kaknikhil opened a new pull request #449: DL: Use REAL[] instead of anyarray 
for aggregating arrays
URL: https://github.com/apache/madlib/pull/449
 
 
   JIRA: MADLIB-1334
   
   Using real[] instead of anyarray for agg_array_concat solves the scaling
   problem i.e. running input preprocessor on same dataset with different
   buffer sizes results in comparable runtimes. This is because the plan
   for the agg with real[] aggregates on the segments and then gathers on
   the master but the plan for the agg with anyarray aggregate first
   gathers all the data on the master and then runs the aggregate function
   on the master.
   
   Here are the results from my local mac with gpdb 5.21 3 segments
   
   ```
   mnist 10k rows anyarray agg
   madlib=# select madlib.training_preprocessor_dl('mnist_train_10k', 
'mnist_batch_10k_anyarray', 'y', 'x');
   Time: 109790.096 ms
   
   mnist 10k rows real_array_agg
   madlib=# select madlib.training_preprocessor_dl('mnist_train_10k', 
'mnist_batch_10k_real_array', 'y', 'x');
   Time: 14086.535 ms
   
   mnist 60k rows with anyarray agg
   madlib=# select madlib.training_preprocessor_dl('mnist_train', 
'mnist_batch_anyarray', 'y', 'x');
   Had to cancel after 55 mins
   
   mnist 60k rows with real_array agg
   madlib=# select madlib.training_preprocessor_dl('mnist_train', 
'mnist_batch_real_array', 'y', 'x');
   Time: 580822.559 ms
   ```
   
   Co-authored-by: Domino Valdano <[email protected]>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to