GitHub user nlgranger edited a discussion: How to find the best write options for read of file bytes ?
# TLDR I am struggling to find optimal configuration for the `ParquetWriter` when the final goal is to read random rows for the resulting dataset. The data is: - one column of file names as a strings - one column containing the bytes of an image file, so up to a few hundred kiB. # What I tried I noticed datasets distributed as parquet files are typically ill suited for fast random row reads. A few I have tested from hugging face notably. Going through the options of the `ParquetWriter`, here are some of the options that might need adjusting: - **Group size:** for random access there is no need to make it too big, but too low can be bad. I assume a low group size slows the search for a row in the list of group statistics. - **Page size:** Since the pyarrow reader does not support page-level index, what is the point of having multiple pages per row? - **Compression:** this black magic, sometimes enabling it works, sometimes it break performances. It seems better to disable it and store already compressed data in the row. Also, the columns used for filtering should never be compressed ? - **Sorting columns:** does sorting has any effect on performance in practice ? - **Bloom filters:** is it supported in pyarrow ? **Could you share some recommendations or guideline to optimize random row reads ?** *(Also, why are Dataset.take() and Table.take() so damn slow ?)* # References: - https://github.com/waymo-research/waymo-open-dataset/issues/856 - https://huggingface.co/docs/hub/en/datasets-streaming#efficient-random-access GitHub link: https://github.com/apache/arrow/discussions/48940 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
