Based on this aligning the default for non-dataset write path to be 1
million rows seems to make sense in the short term
On Tuesday, February 22, 2022, Shawn Zeng wrote:
> Thanks for Weston's clear explanation. Point 2 is what I've experienced
> without tuning parameter and point 3 is what I
Thanks for Weston's clear explanation. Point 2 is what I've experienced
without tuning parameter and point 3 is what I concerned about. Looking
forward for finer granularity of reading/indexing parquet than row group
level, which should fix the issue.
Weston Pace 于2022年2月23日周三 13:30写道:
> These
These are all great points. A few notes from my own experiments
(mostly confirming what others have said):
1) 1M rows is the minimum safe size for row groups on HDD (and
perhaps a hair too low in some situations) if you are doing any kind
of column selection (i.e. projection pushdown). As that
If you are going to read all the dictionary blocks prior to reading any
record batch anyway there is for sure a way to make it work now without
changing the file format itself. I think, however, that if what is there
currently is working there is no meaningful advantage gained by adding
whatever
>
> What is the reason for this? Do you plan to change the default?
I think there is some confusion, I do believe this is the number of rows
but I'd guess it was set to 64M because it wasn't carefully adapted from
parquet-mr which I would guess uses byte size and therefore it aligns well
with
OK, thanks, I will work with delta dictionaries.
How do delta dictionaries solve the random access issue?
On Tue, Feb 22, 2022 at 9:51 AM Micah Kornfield
wrote:
> Dictionary replacement isn't supported in the file format because the
> metadata makes it difficult to associate a particular
Dictionary replacement isn't supported in the file format because the
metadata makes it difficult to associate a particular dictionary with a
record batch for Random access.
Delta dictionaries are supported but there was a long standing bug that
prevented there use in Python (
Thanks for your replies.
> Is your goal to have libarrow be loaded from a relative path of
libparquet?
My impression is that I don't have a choice if I installed from homebrew.
That @rpath/libarrow.700.dylib reference looks like it's hard-coded in the
libparquet binary.
> maybe it is connected
How are dictionaries intended to be used in a file with multiple record
batches?
I tried saving record-batch-specific dictionaries and got this error from
python:
> pyarrow.lib.ArrowInvalid: Unsupported dictionary replacement or
dictionary delta in IPC file
This seems to defeat the purpose of
hi Shawn,
I expect this is the default because Parquet comes from the Hadoop
ecosystem, and the Hadoop block size is usually set to 64MB. Why would you
need a different default? You can set it to the size that fits your use
case best, right?
Marnix
On Tue, Feb 22, 2022 at 1:42 PM Shawn Zeng
Hi,
The default row_group_size is really large, which means a large table
smaller than 64M rows will not get the benefits of row group level
statistics. What is the reason for this? Do you plan to change the default?
Thanks,
Shawn
hi KB,
Can you be more precise in your question? I understand that you want to get
all the different struct types in an Arrow dataframe(s) for analytical
processing, but I need an idea on how you want to deal with the
different types before I could attempt to give an answer that makes sense.
One
12 matches
Mail list logo