Re: Fine tunning pyarrow.dataset.dataset with adlfs

Felipe Oliveira Carvalho Thu, 07 Mar 2024 06:45:26 -0800

1. the first read is always 65536, then it is followed by read of the
size of parquet.

This might be a constant inside adlfs or the Azure SDK itself (?). I
don't know from the top of my head if Parquet always reads 64k or
that's an Azure SDK thing.

2. looks like parquet footer is read on almost every subsequent call

It might be a good idea to post a sample of code so the meaning of
"subsequent call" becomes more clear. Caching can be problematic
because it's easy to use too much memory with data that doesn't get
re-used and/or become outdated compared to the source.

PS: Arrow 16 (the next release) is going to have almost-complete Azure
Data Lake FS support built-in [1] which might allow us to tweak the
way it interacts with Parquet reader more deeply.

--
Felipe

[1] https://github.com/apache/arrow/issues/18014 (Python bindings and
URI parsing are still work in progress)

On Tue, Mar 5, 2024 at 2:44 PM Jacek Pliszka <jacek.plis...@gmail.com> wrote:
>
> Hi!
>
> I have noticed 2 things while using
> pyarrow.dataset.dataset  with ADLFS with parquet and I wonder if this is 
> something
> worth opening a ticket for.
>
> 1. the first read is always 65536, then it is followed by read of the size of 
> parquet.
> I wonder if there is a way to have the size of the first read defined and 
> have just 1 read.
> I pretty much know how large is the footer in parquet files I am getting
> and I would like to read it in one request.
>
> 2. looks like parquet footer is read on almost every subsequent call . Is 
> there a way to cache
> parquet footer so it is not read every time?
>
> Thanks in advance for your insights,
>
> Jacek
>
>

Re: Fine tunning pyarrow.dataset.dataset with adlfs

Reply via email to