ggershinsky commented on pull request #32473:
URL: https://github.com/apache/spark/pull/32473#issuecomment-838117882


   Thanks @huaxingao . I think my basic question is about the extent to which 
these results are representative for a typical user. If the default block size 
is 128MB, providing numbers for 0.5-10MB blocks might seem to be unhelpful. If 
the recommendation is to use 1 MB or 4MB block size, this is a problem, because 
the default page size in Parquet is 1MB; having a very small block with one or 
a few pages might be good for bloom filtering, but is bad for other performance 
optimizations. The 128MB blocks could be bad for bloom filtering because there 
is just one row group when you write 100M records. And there are other 
parameters, like the `DEFAULT_MAX_BLOOM_FILTER_BYTES` mentioned by @sunchao . 
So this is not a comprehensive benchmark.
   But maybe it's ok. Still, at a minimum, I would recommend removing the 
results for 0.5MB and 1MB block sizes- because row groups that small, with just 
one data page, don't make much sense. I would also suggest adding measurements 
for 16, 64 and 128MB; this might be of some help for the users of parquet bloom 
filters.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to