clintropolis commented on issue #6016: Druid 'Shapeshifting' Columns
URL: https://github.com/apache/incubator-druid/pull/6016#issuecomment-407572959
 
 
   @himanshug Thanks for having a look!
   
   >First "Time to select rows" benchmark appears to have a peak just before 3M 
. Assuming that peak is at x M, it says performance would be better when 
selecting (x+delta) rows instead of x rows. I'm interested if there is an 
explanation for it. That peak shows up at different points in x-axis very 
consistently in all similar graphs.
   
   My bad, I think I should've done a better job of trying to call out that the 
last datapoint is special on the 'time to select n rows' graphs. It represents 
the raw speed of a total scan, where the rest of the data points are collected 
using a `BitSet` to simulate a filter. The code in the benchmark test probably 
explains it best:
   ```
       IndexedInts encoder = encoders.get(encoding);
       if (filter == null) {
         for (int i = 0; i < rows; i++) {
           blackhole.consume(encoder.get(i));
         }
       } else {
         for (int i = filter.nextSetBit(0); i >= 0; i = filter.nextSetBit(i + 
1)) {
           blackhole.consume(encoder.get(i));
         }
       }
   ```
   Data visualization is hard 😄 Is it worth reworking the plots to not include 
this data point as part of the same line? Or perhaps as a separate 'total scan' 
line connected to another new 'time to select 1 row' data point? I also 
considered doing a filter of `numRows - 1` to show it as the 'cliff' it is 
instead of producing what appears to be a peak.
   
   > In "The bad" section , it seems ShapeShiftingColumn will outperform 
current impl in all cases, if they could use blocks of varying sizes. That 
sounds great and deserves ateast validating that. If true, then I think it is 
well worth it even with a little bit of extra heap required given that this 
feature already requires re-tuning heap and maybe other jvm params.
   
   It's less the varying that could achieve this than just having a block size 
that is equivalent (in terms of how many values it can hold) to the largest 
size `CompressedVSizeColumnarIntsSerializer` can produce which is 2^16 if all 
values can be represented with a single byte. Heap usage to support this block 
size in shape-shift is ~4x higher than it is with the current largest block 
size it offers (2^14 values), but I agree it is probably worth investigating.
   
   > Does new design to read data increase heap reqirements, if yes then that 
would not be end of the world but deserves mention in the "bad" section (so 
re-tuning at historicals as well). 
   
   Yes, on the query side heap usage is in the form of primitive arrays that 
are pooled in the same manner as the direct bytebuffers used for decompressing 
values. I will edit the summary to clarify that it introduces heap usage on 
both indexing and query side.
   
   > Also new read design introduces mutation and I hope this doesn't mean 
requiring locking etc in the face of concurrent segment read or else that might 
cause more problems than being solved.
   
   I believe this should all be contained within a single thread, so locking 
shouldn't be necessary.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org

Reply via email to