Re: [Python][Parquet]Why the default row_group_size is 64M?

2022-02-22 Thread Micah Kornfield
Based on this aligning the default for non-dataset write path to be 1 million rows seems to make sense in the short term On Tuesday, February 22, 2022, Shawn Zeng wrote: > Thanks for Weston's clear explanation. Point 2 is what I've experienced > without tuning parameter and point 3 is what I

Re: [Python][Parquet]Why the default row_group_size is 64M?

2022-02-22 Thread Shawn Zeng
Thanks for Weston's clear explanation. Point 2 is what I've experienced without tuning parameter and point 3 is what I concerned about. Looking forward for finer granularity of reading/indexing parquet than row group level, which should fix the issue. Weston Pace 于2022年2月23日周三 13:30写道: > These

Re: [Python][Parquet]Why the default row_group_size is 64M?

2022-02-22 Thread Weston Pace
These are all great points. A few notes from my own experiments (mostly confirming what others have said): 1) 1M rows is the minimum safe size for row groups on HDD (and perhaps a hair too low in some situations) if you are doing any kind of column selection (i.e. projection pushdown). As that

Re: Dictionaries and multiple record batches

2022-02-22 Thread Chris Nuernberger
If you are going to read all the dictionary blocks prior to reading any record batch anyway there is for sure a way to make it work now without changing the file format itself. I think, however, that if what is there currently is working there is no meaningful advantage gained by adding whatever

Re: [Python][Parquet]Why the default row_group_size is 64M?

2022-02-22 Thread Micah Kornfield
> > What is the reason for this? Do you plan to change the default? I think there is some confusion, I do believe this is the number of rows but I'd guess it was set to 64M because it wasn't carefully adapted from parquet-mr which I would guess uses byte size and therefore it aligns well with

Re: Dictionaries and multiple record batches

2022-02-22 Thread Chris Nuernberger
OK, thanks, I will work with delta dictionaries. How do delta dictionaries solve the random access issue? On Tue, Feb 22, 2022 at 9:51 AM Micah Kornfield wrote: > Dictionary replacement isn't supported in the file format because the > metadata makes it difficult to associate a particular

Dictionaries and multiple record batches

2022-02-22 Thread Micah Kornfield
Dictionary replacement isn't supported in the file format because the metadata makes it difficult to associate a particular dictionary with a record batch for Random access. Delta dictionaries are supported but there was a long standing bug that prevented there use in Python (

Re: RPATH and Brew on MacOS

2022-02-22 Thread Will Jones
Thanks for your replies. > Is your goal to have libarrow be loaded from a relative path of libparquet? My impression is that I don't have a choice if I installed from homebrew. That @rpath/libarrow.700.dylib reference looks like it's hard-coded in the libparquet binary. > maybe it is connected

Dictionaries and multiple record batches

2022-02-22 Thread Chris Nuernberger
How are dictionaries intended to be used in a file with multiple record batches? I tried saving record-batch-specific dictionaries and got this error from python: > pyarrow.lib.ArrowInvalid: Unsupported dictionary replacement or dictionary delta in IPC file This seems to defeat the purpose of

Re: [Python][Parquet]Why the default row_group_size is 64M?

2022-02-22 Thread Marnix van den Broek
hi Shawn, I expect this is the default because Parquet comes from the Hadoop ecosystem, and the Hadoop block size is usually set to 64MB. Why would you need a different default? You can set it to the size that fits your use case best, right? Marnix On Tue, Feb 22, 2022 at 1:42 PM Shawn Zeng

[Python][Parquet]Why the default row_group_size is 64M?

2022-02-22 Thread Shawn Zeng
Hi, The default row_group_size is really large, which means a large table smaller than 64M rows will not get the benefits of row group level statistics. What is the reason for this? Do you plan to change the default? Thanks, Shawn

Re: Looking for suggestions on approach

2022-02-22 Thread Marnix van den Broek
hi KB, Can you be more precise in your question? I understand that you want to get all the different struct types in an Arrow dataframe(s) for analytical processing, but I need an idea on how you want to deal with the different types before I could attempt to give an answer that makes sense. One