kaknikhil opened a new pull request #449: DL: Use REAL[] instead of anyarray for aggregating arrays URL: https://github.com/apache/madlib/pull/449 JIRA: MADLIB-1334 Using real[] instead of anyarray for agg_array_concat solves the scaling problem i.e. running input preprocessor on same dataset with different buffer sizes results in comparable runtimes. This is because the plan for the agg with real[] aggregates on the segments and then gathers on the master but the plan for the agg with anyarray aggregate first gathers all the data on the master and then runs the aggregate function on the master. Here are the results from my local mac with gpdb 5.21 3 segments ``` mnist 10k rows anyarray agg madlib=# select madlib.training_preprocessor_dl('mnist_train_10k', 'mnist_batch_10k_anyarray', 'y', 'x'); Time: 109790.096 ms mnist 10k rows real_array_agg madlib=# select madlib.training_preprocessor_dl('mnist_train_10k', 'mnist_batch_10k_real_array', 'y', 'x'); Time: 14086.535 ms mnist 60k rows with anyarray agg madlib=# select madlib.training_preprocessor_dl('mnist_train', 'mnist_batch_anyarray', 'y', 'x'); Had to cancel after 55 mins mnist 60k rows with real_array agg madlib=# select madlib.training_preprocessor_dl('mnist_train', 'mnist_batch_real_array', 'y', 'x'); Time: 580822.559 ms ``` Co-authored-by: Domino Valdano <[email protected]>
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
