RE: Usage of Azure filesystem with fsspec and adlfs and pyarrow to download a list of blobs (parquets) concurrently with columns pruning and rows filtering

2023-12-11 Thread Luca Maurelli
Yes, the filters are written wrongly. In production we do pre-select the right blobs and then post-filter the right rows for a finer control. The example was just an example to download many files and test the timings. To date, I have a custom solution directly exploiting the Azure Python SDK

Re: Usage of Azure filesystem with fsspec and adlfs and pyarrow to download a list of blobs (parquets) concurrently with columns pruning and rows filtering

2023-12-04 Thread Weston Pace
> > The ability to read a single file in parallel is not going to be important > here (each file is very small). However, you will want to make sure it > is reading multiple files at once. I would expect that it is doing so but > this would be a good thing to verify if you can. > > One quick

Re: Usage of Azure filesystem with fsspec and adlfs and pyarrow to download a list of blobs (parquets) concurrently with columns pruning and rows filtering

2023-12-04 Thread Jacek Pliszka
pon., 4 gru 2023 o 14:41 Luca Maurelli napisał(a): > Thank you @Jacek Pliszka for you feedback! > Below my answers: > > > > *From:* Jacek Pliszka > *Sent:* venerdì 1 dicembre 2023 17:35 > *To:* user@arrow.apache.org > *Subject:* Re: Usage of Azure filesystem with fsspec and adlfs and > pyarrow

RE: Usage of Azure filesystem with fsspec and adlfs and pyarrow to download a list of blobs (parquets) concurrently with columns pruning and rows filtering

2023-12-04 Thread Luca Maurelli
Thank you @Jacek Pliszka for you feedback! Below my answers: From: Jacek Pliszka Sent: venerdì 1 dicembre 2023 17:35 To: user@arrow.apache.org Subject: Re: Usage of Azure filesystem with fsspec and adlfs and pyarrow to download a list of blobs (parquets)

RE: Usage of Azure filesystem with fsspec and adlfs and pyarrow to download a list of blobs (parquets) concurrently with columns pruning and rows filtering

2023-12-04 Thread Luca Maurelli
Thank you @Weston Pace for your feedback. Below my answers in red: From: Weston Pace Sent: venerdì 1 dicembre 2023 15:46 To: user@arrow.apache.org Subject: Re: Usage of Azure filesystem with fsspec and adlfs and pyarrow to download a list of blobs (parquets)

Re: Usage of Azure filesystem with fsspec and adlfs and pyarrow to download a list of blobs (parquets) concurrently with columns pruning and rows filtering

2023-12-01 Thread Jacek Pliszka
Hi! These files seem to be below 4MB which is the default Azure block. Possibly they are all read in full. Does someone know if in the approach below the blob is read only once from Azure even if multiple reads are called? Is there correlation between "my_index" and the filenames you could

Re: Usage of Azure filesystem with fsspec and adlfs and pyarrow to download a list of blobs (parquets) concurrently with columns pruning and rows filtering

2023-12-01 Thread Weston Pace
Those files are quite small. For every single file pyarrow is going to need to read the metadata, determine which columns to read (column filtering), determine if any of the rows need to be read (using row filtering) and then actually issue the read. If you combined all those files into one file

Usage of Azure filesystem with fsspec and adlfs and pyarrow to download a list of blobs (parquets) concurrently with columns pruning and rows filtering

2023-12-01 Thread Luca Maurelli
I'm new to these libraries so bear with me, I am learning a lot these days. I started using fsspec and adlfs with the idea of switching between a cloud storage to a local storage with little effort. I read that adlfs makes use of the Azure Blob Storage Python SDK which supports the use of