ggershinsky commented on pull request #32473: URL: https://github.com/apache/spark/pull/32473#issuecomment-838117882
Thanks @huaxingao . I think my basic question is about the extent to which these results are representative for a typical user. If the default block size is 128MB, providing numbers for 0.5-10MB blocks might seem to be unhelpful. If the recommendation is to use 1 MB or 4MB block size, this is a problem, because the default page size in Parquet is 1MB; having a very small block with one or a few pages might be good for bloom filtering, but is bad for other performance optimizations. The 128MB blocks could be bad for bloom filtering because there is just one row group when you write 100M records. And there are other parameters, like the `DEFAULT_MAX_BLOOM_FILTER_BYTES` mentioned by @sunchao . So this is not a comprehensive benchmark. But maybe it's ok. Still, at a minimum, I would recommend removing the results for 0.5MB and 1MB block sizes- because row groups that small, with just one data page, don't make much sense. I would also suggest adding measurements for 16, 64 and 128MB; this might be of some help for the users of parquet bloom filters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org