Re: Python: going from DictionaryType to array of integers and str->int mapping?

2024-04-29 Thread Jacek Pliszka
orrect? > If so, is calling first unify_dictionaries() necessary? > > Also, if the operations only work on chunks is it up to the user to > iterate through all chunks to create the resulting array of integers? > > Best, > > Laurent > > > Le dim. 28 avr. 2024 à 14:28, Jacek

Re: Python: going from DictionaryType to array of integers and str->int mapping?

2024-04-28 Thread Jacek Pliszka
Hi! table.column('a').chunk(0).dictionary returns dictionary values as an array that you can map... Then you can construct new Dictionary Type columns from the mapped values and table.column('a').chunk(0).indices using pa.DictionaryArray.from_arrays BR J niedz., 28 kwi 2024 o 20:19 Laurent

Re: rows reshuffled on join

2024-04-16 Thread Jacek Pliszka
the rows in those files, or the >>> original order when the input is in memory) that will be respected if there >>> are no joins or aggregates. >>> >>> On Tue, Apr 16, 2024 at 8:19 AM Aldrin wrote: >>> >>>> I think that ordering is only

rows reshuffled on join

2024-04-16 Thread Jacek Pliszka
mentioned in the docs. I am using 12.0.1 due to Python 3.7 dependency. Best Regards, Jacek Pliszka

Re: Fine tunning pyarrow.dataset.dataset with adlfs

2024-03-07 Thread Jacek Pliszka
PS: Arrow 16 (the next release) is going to have almost-complete Azure > > Data Lake FS support built-in [1] which might allow us to tweak the > > way it interacts with Parquet reader more deeply. > > > > -- > > Felipe > > > > [1] https://github.com/apa

Fine tunning pyarrow.dataset.dataset with adlfs

2024-03-05 Thread Jacek Pliszka
Hi! I have noticed 2 things while using pyarrow.dataset.dataset with ADLFS with parquet and I wonder if this is something worth opening a ticket for. 1. the first read is always 65536, then it is followed by read of the size of parquet. I wonder if there is a way to have the size of the first

Re: Usage of Azure filesystem with fsspec and adlfs and pyarrow to download a list of blobs (parquets) concurrently with columns pruning and rows filtering

2023-12-04 Thread Jacek Pliszka
pon., 4 gru 2023 o 14:41 Luca Maurelli napisał(a): > Thank you @Jacek Pliszka for you feedback! > Below my answers: > > > > *From:* Jacek Pliszka > *Sent:* venerdì 1 dicembre 2023 17:35 > *To:* user@arrow.apache.org > *Subject:* Re: Usage of Azure filesystem with fss

Re: Usage of Azure filesystem with fsspec and adlfs and pyarrow to download a list of blobs (parquets) concurrently with columns pruning and rows filtering

2023-12-01 Thread Jacek Pliszka
Hi! These files seem to be below 4MB which is the default Azure block. Possibly they are all read in full. Does someone know if in the approach below the blob is read only once from Azure even if multiple reads are called? Is there correlation between "my_index" and the filenames you could

Re: [Python/C++] pyarrow.parquet.read_table stopped working in version 12.0.0

2023-11-16 Thread Jacek Pliszka
tputStream() pq.write_table(t, f) b=f.getvalue() ds = pq.ParquetDataset(b, filters=[['d', '=', 1]]) ds.read() # fails funnily: ['d', '=', np.int16(1000)] works while ['d', '=', np.int16(1)] fails Best Regards, Jacek czw., 16 lis 2023 o 13:42 Jacek Pliszka napisał(a): > > Hi! > > I foun

[Python/C++] pyarrow.parquet.read_table stopped working in version 12.0.0

2023-11-16 Thread Jacek Pliszka
rs=[['dec', '==', pc.cast(8024, pa.decimal128(38,10))]]) Does someone know what happened? It looks kind of strange that it works for np.int16 and decimal but not int64. And 29 seems confusing as 2**64<10**20 and 38-10=28 > 20 Thanks for any help, Jacek Pliszka

A bit surprising results

2023-11-08 Thread Jacek Pliszka
Hi! I got surprising results when comparing numpy and pyarrow performance. val = np.uint8(115) numpy has similar speed if using 115 and np.uint8(115): %timeit np.count_nonzero(data_np == val) 591 µs ± 3.56 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) %timeit

Any efficient way of partitioning tables in memory?

2023-09-29 Thread Jacek Pliszka
Hi! I am looking for an efficient way of working with pieces of a Table Let's say I got a table with 100M rows with a key column having 100 different values. I would like to be able to quickly get a subtable with just rows for the given key value. Currently I run filter 100 times to generate

Re: [python] Diffing 2 large tables

2023-07-08 Thread Jacek Pliszka
to re-write the extracts from Postgres. Is there an > easy way to partition the results of a SQL query or would I need to write > something? > > Many thanks > > Adrian > > On Fri, 7 Jul 2023 at 12:53, Jacek Pliszka wrote: >> >> Hi! >> >> If you ha

Re: [python] Diffing 2 large tables

2023-07-07 Thread Jacek Pliszka
Hi! If you have any influence over how data is dumped from postgres - my suggestion is to have it already partitioned then. This would make parallelization much easier, BR Jacek pt., 7 lip 2023 o 12:21 Adrian Mowat napisał(a): > > Hi, > > TL;DR: newbie question. I have an pyarrow program

Re: [python] Using Arrow for storing compressable python dictionaries

2022-11-23 Thread Jacek Pliszka
; good file format for this problem. > > It naturally supports slicing like fp['field1'][1000:5000], provides chunking > and compression, new arrays can be appended... Maybe Arrow is just not the > right tool for this specific problem. > > Kind regards, > > Ramon. > &g

Re: [python] Using Arrow for storing compressable python dictionaries

2022-11-23 Thread Jacek Pliszka
Hi! I am not sure if this would solve your problem: pa.concat_tables([pa.Table.from_pydict({'v': v}).append_column('f', [len(v)*[f]]) for f, v in x.items()]) pyarrow.Table v: double f: string v: [[0.2,0.2,0.2,0.1,0.2,0,0.8,0.7],[0.3,0.5,0.1],[0.9,nan,nan,0.1,0.5]] f:

Re: String Array Concatenation function?

2022-09-27 Thread Jacek Pliszka
Hi! I think API section is more user friendly: https://arrow.apache.org/docs/python/api/compute.html#api-compute https://arrow.apache.org/docs/python/generated/pyarrow.compute.binary_join_element_wise.html#pyarrow.compute.binary_join_element_wise BR J pon., 26 wrz 2022 o 23:48 Ian Cook

Re: [Python] - Dataset API - What's happening under the hood?

2022-09-19 Thread Jacek Pliszka
Re 2. In Python Azure SDK there is logic for partial blob read: https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobclient?view=azure-python#azure-storage-blob-blobclient-query-blob However I was unable to use it as it does not support parquet files with

Re: [c++][compute]Is there any other way to use Join besides Acero?

2022-09-15 Thread Jacek Pliszka
Hi! Why don't you use arrow Table join directly ? https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.join Though you need to be careful with join order as speed may be differ depending on order of the joined tables. BR, Jacek czw., 15 wrz 2022 o 06:15 Weston

Re: why that take so many times to read parquets file with 300 000 columns

2021-03-01 Thread Jacek Pliszka
Should be - if you need cast... t.column(i).cast(..) uses arrow cast.. BR, Jacek pon., 1 mar 2021 o 17:04 Jacek Pliszka napisał(a): > > Use np.column_stack and list comprehension: > > t = pq.read_table('a.pq') > matrix = np.column_stack([t.column(i) for i in range(t.num_column

Re: why that take so many times to read parquets file with 300 000 columns

2021-03-01 Thread Jacek Pliszka
((data.num_rows,data.num_columns),dtype=np.bool_) > for i,col in enumerate(data.columns): > matrix[:,i] = col > > > > > Le lundi 01 mars 2021 à 11:31 +0100, Jacek Pliszka a écrit : > > Other will probably give you better hints but > > > > You do not need to conver

Re: Can I load from a parquet file only few columns ?

2021-02-12 Thread Jacek Pliszka
> Anyway > > Have a goo day > > Le vendredi 12 février 2021 à 15:26 +0100, Jacek Pliszka a écrit : > > Sure - I believe you can do it even in pandas - you have columns > > parameter: pd.read_parquet('f.pq', columns=['A', 'B']) > > > > arrow is more us

Re: Can I load from a parquet file only few columns ?

2021-02-12 Thread Jacek Pliszka
Sure - I believe you can do it even in pandas - you have columns parameter: pd.read_parquet('f.pq', columns=['A', 'B']) arrow is more useful if you need to do some conversion of filtering. BR, Jacek pt., 12 lut 2021 o 15:21 jonathan mercier napisał(a): > > Dear, > I have a parquet files with

Re: Does Arrow Support Larger-than-Memory Handling?

2020-10-22 Thread Jacek Pliszka
I believe it would be good if you define your use case. I do handle larger than memory datasets with pyarrow with the use of dataset.scan but my use case is very specific as I am repartitioning and cleaning a bit large datasets. BR, Jacek czw., 22 paź 2020 o 20:39 Jacob Zelko napisał(a): > >