reductionista edited a comment on issue #467: DL: Improve performance of 
mini-batch preprocessor
URL: https://github.com/apache/madlib/pull/467#issuecomment-573278611
 
 
   > @reductionista Looking at the 5k buffer size runs, can you pls just double 
check in the code that we are skipping the normalization if the normalization 
factor is `1.0` or `NULL` ? I am asking since there is no performance 
improvement on my small test cluster when I skip normalization.
   
   I turned on debugging for a small 3-segment cluster and verified that 
`scalar_array_mult()` is not called for the `NULL/1.0` cases.
   
   Breaking down the timings of the individual queries, the 4 main stages for 
the 5k buffer size with NULL normalization are:
   ```
   1-hot encoding :   2.5s
   batching:  422s
   redistribution: 1.5s
   bytea conversion:  35s
   ```
   
   For 5k buffer size with 256.0 normalization they look like:
   ```
   1-hot encoding + normalization:  44s
   batching:  445s
   redistribution:  1.7s
   bytea conversion:  0.1s
   ```
   
   The only thing odd I notice here is that the bytea conversion for some 
reason took much longer when the normalization was skipped.  I think this means 
there is some time spent casting INTEGER type to REAL, which either has to 
happen during the normalization or later during the bytea conversion--I don't 
think we had realized that before, but it seems to make the time savings for 
this case almost negligible.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to