from:"Jacek Pliszka"

Re: Python: going from DictionaryType to array of integers and str->int mapping?

2024-04-29 Thread Jacek Pliszka

orrect? > If so, is calling first unify_dictionaries() necessary? > > Also, if the operations only work on chunks is it up to the user to > iterate through all chunks to create the resulting array of integers? > > Best, > > Laurent > > > Le dim. 28 avr. 2024 à 14:28, Jacek

Re: Python: going from DictionaryType to array of integers and str->int mapping?

2024-04-28 Thread Jacek Pliszka

Hi! table.column('a').chunk(0).dictionary returns dictionary values as an array that you can map... Then you can construct new Dictionary Type columns from the mapped values and table.column('a').chunk(0).indices using pa.DictionaryArray.from_arrays BR J niedz., 28 kwi 2024 o 20:19 Laurent

Re: rows reshuffled on join

2024-04-16 Thread Jacek Pliszka

the rows in those files, or the >>> original order when the input is in memory) that will be respected if there >>> are no joins or aggregates. >>> >>> On Tue, Apr 16, 2024 at 8:19 AM Aldrin wrote: >>> >>>> I think that ordering is only

rows reshuffled on join

2024-04-16 Thread Jacek Pliszka

mentioned in the docs. I am using 12.0.1 due to Python 3.7 dependency. Best Regards, Jacek Pliszka

Re: Fine tunning pyarrow.dataset.dataset with adlfs

2024-03-07 Thread Jacek Pliszka

PS: Arrow 16 (the next release) is going to have almost-complete Azure > > Data Lake FS support built-in [1] which might allow us to tweak the > > way it interacts with Parquet reader more deeply. > > > > -- > > Felipe > > > > [1] https://github.com/apa

Fine tunning pyarrow.dataset.dataset with adlfs

2024-03-05 Thread Jacek Pliszka

Hi! I have noticed 2 things while using pyarrow.dataset.dataset with ADLFS with parquet and I wonder if this is something worth opening a ticket for. 1. the first read is always 65536, then it is followed by read of the size of parquet. I wonder if there is a way to have the size of the first

Re: Usage of Azure filesystem with fsspec and adlfs and pyarrow to download a list of blobs (parquets) concurrently with columns pruning and rows filtering

2023-12-04 Thread Jacek Pliszka

pon., 4 gru 2023 o 14:41 Luca Maurelli napisał(a): > Thank you @Jacek Pliszka for you feedback! > Below my answers: > > > > *From:* Jacek Pliszka > *Sent:* venerdì 1 dicembre 2023 17:35 > *To:* user@arrow.apache.org > *Subject:* Re: Usage of Azure filesystem with fss

Re: Usage of Azure filesystem with fsspec and adlfs and pyarrow to download a list of blobs (parquets) concurrently with columns pruning and rows filtering

2023-12-01 Thread Jacek Pliszka

Hi! These files seem to be below 4MB which is the default Azure block. Possibly they are all read in full. Does someone know if in the approach below the blob is read only once from Azure even if multiple reads are called? Is there correlation between "my_index" and the filenames you could

Re: [Python/C++] pyarrow.parquet.read_table stopped working in version 12.0.0

2023-11-16 Thread Jacek Pliszka

tputStream() pq.write_table(t, f) b=f.getvalue() ds = pq.ParquetDataset(b, filters=[['d', '=', 1]]) ds.read() # fails funnily: ['d', '=', np.int16(1000)] works while ['d', '=', np.int16(1)] fails Best Regards, Jacek czw., 16 lis 2023 o 13:42 Jacek Pliszka napisał(a): > > Hi! > > I foun

[Python/C++] pyarrow.parquet.read_table stopped working in version 12.0.0

2023-11-16 Thread Jacek Pliszka

rs=[['dec', '==', pc.cast(8024, pa.decimal128(38,10))]]) Does someone know what happened? It looks kind of strange that it works for np.int16 and decimal but not int64. And 29 seems confusing as 2**64<10**20 and 38-10=28 > 20 Thanks for any help, Jacek Pliszka

A bit surprising results

2023-11-08 Thread Jacek Pliszka

Hi! I got surprising results when comparing numpy and pyarrow performance. val = np.uint8(115) numpy has similar speed if using 115 and np.uint8(115): %timeit np.count_nonzero(data_np == val) 591 µs ± 3.56 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) %timeit

Any efficient way of partitioning tables in memory?

2023-09-29 Thread Jacek Pliszka

Hi! I am looking for an efficient way of working with pieces of a Table Let's say I got a table with 100M rows with a key column having 100 different values. I would like to be able to quickly get a subtable with just rows for the given key value. Currently I run filter 100 times to generate

Re: [python] Diffing 2 large tables

2023-07-08 Thread Jacek Pliszka

to re-write the extracts from Postgres. Is there an > easy way to partition the results of a SQL query or would I need to write > something? > > Many thanks > > Adrian > > On Fri, 7 Jul 2023 at 12:53, Jacek Pliszka wrote: >> >> Hi! >> >> If you ha

Re: [python] Diffing 2 large tables

2023-07-07 Thread Jacek Pliszka

Hi! If you have any influence over how data is dumped from postgres - my suggestion is to have it already partitioned then. This would make parallelization much easier, BR Jacek pt., 7 lip 2023 o 12:21 Adrian Mowat napisał(a): > > Hi, > > TL;DR: newbie question. I have an pyarrow program

Re: [python] Using Arrow for storing compressable python dictionaries

2022-11-23 Thread Jacek Pliszka

; good file format for this problem. > > It naturally supports slicing like fp['field1'][1000:5000], provides chunking > and compression, new arrays can be appended... Maybe Arrow is just not the > right tool for this specific problem. > > Kind regards, > > Ramon. > &g

Re: [python] Using Arrow for storing compressable python dictionaries

2022-11-23 Thread Jacek Pliszka

Hi! I am not sure if this would solve your problem: pa.concat_tables([pa.Table.from_pydict({'v': v}).append_column('f', [len(v)*[f]]) for f, v in x.items()]) pyarrow.Table v: double f: string v: [[0.2,0.2,0.2,0.1,0.2,0,0.8,0.7],[0.3,0.5,0.1],[0.9,nan,nan,0.1,0.5]] f:

Re: String Array Concatenation function?

2022-09-27 Thread Jacek Pliszka

Hi! I think API section is more user friendly: https://arrow.apache.org/docs/python/api/compute.html#api-compute https://arrow.apache.org/docs/python/generated/pyarrow.compute.binary_join_element_wise.html#pyarrow.compute.binary_join_element_wise BR J pon., 26 wrz 2022 o 23:48 Ian Cook

Re: [Python] - Dataset API - What's happening under the hood?

2022-09-19 Thread Jacek Pliszka

Re 2. In Python Azure SDK there is logic for partial blob read: https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobclient?view=azure-python#azure-storage-blob-blobclient-query-blob However I was unable to use it as it does not support parquet files with

Re: [c++][compute]Is there any other way to use Join besides Acero？

2022-09-15 Thread Jacek Pliszka

Hi! Why don't you use arrow Table join directly ? https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.join Though you need to be careful with join order as speed may be differ depending on order of the joined tables. BR, Jacek czw., 15 wrz 2022 o 06:15 Weston

Re: why that take so many times to read parquets file with 300 000 columns

2021-03-01 Thread Jacek Pliszka

Should be - if you need cast... t.column(i).cast(..) uses arrow cast.. BR, Jacek pon., 1 mar 2021 o 17:04 Jacek Pliszka napisał(a): > > Use np.column_stack and list comprehension: > > t = pq.read_table('a.pq') > matrix = np.column_stack([t.column(i) for i in range(t.num_column

Re: why that take so many times to read parquets file with 300 000 columns

2021-03-01 Thread Jacek Pliszka

((data.num_rows,data.num_columns),dtype=np.bool_) > for i,col in enumerate(data.columns): > matrix[:,i] = col > > > > > Le lundi 01 mars 2021 à 11:31 +0100, Jacek Pliszka a écrit : > > Other will probably give you better hints but > > > > You do not need to conver

Re: Can I load from a parquet file only few columns ?

2021-02-12 Thread Jacek Pliszka

> Anyway > > Have a goo day > > Le vendredi 12 février 2021 à 15:26 +0100, Jacek Pliszka a écrit : > > Sure - I believe you can do it even in pandas - you have columns > > parameter: pd.read_parquet('f.pq', columns=['A', 'B']) > > > > arrow is more us

Re: Can I load from a parquet file only few columns ?

2021-02-12 Thread Jacek Pliszka

Sure - I believe you can do it even in pandas - you have columns parameter: pd.read_parquet('f.pq', columns=['A', 'B']) arrow is more useful if you need to do some conversion of filtering. BR, Jacek pt., 12 lut 2021 o 15:21 jonathan mercier napisał(a): > > Dear, > I have a parquet files with

Re: Does Arrow Support Larger-than-Memory Handling?

2020-10-22 Thread Jacek Pliszka

I believe it would be good if you define your use case. I do handle larger than memory datasets with pyarrow with the use of dataset.scan but my use case is very specific as I am repartitioning and cleaning a bit large datasets. BR, Jacek czw., 22 paź 2020 o 20:39 Jacob Zelko napisał(a): > >

Re: Python: going from DictionaryType to array of integers and str->int mapping?

Re: Python: going from DictionaryType to array of integers and str->int mapping?

Re: rows reshuffled on join

rows reshuffled on join

Re: Fine tunning pyarrow.dataset.dataset with adlfs

Fine tunning pyarrow.dataset.dataset with adlfs

Re: Usage of Azure filesystem with fsspec and adlfs and pyarrow to download a list of blobs (parquets) concurrently with columns pruning and rows filtering

Re: Usage of Azure filesystem with fsspec and adlfs and pyarrow to download a list of blobs (parquets) concurrently with columns pruning and rows filtering

Re: [Python/C++] pyarrow.parquet.read_table stopped working in version 12.0.0

[Python/C++] pyarrow.parquet.read_table stopped working in version 12.0.0

A bit surprising results

Any efficient way of partitioning tables in memory?

Re: [python] Diffing 2 large tables

Re: [python] Diffing 2 large tables

Re: [python] Using Arrow for storing compressable python dictionaries

Re: [python] Using Arrow for storing compressable python dictionaries

Re: String Array Concatenation function?

Re: [Python] - Dataset API - What's happening under the hood?

Re: [c++][compute]Is there any other way to use Join besides Acero？

Re: why that take so many times to read parquets file with 300 000 columns

Re: why that take so many times to read parquets file with 300 000 columns

Re: Can I load from a parquet file only few columns ?

Re: Can I load from a parquet file only few columns ?

Re: Does Arrow Support Larger-than-Memory Handling?

24 matches

Site Navigation

Mail list logo

Footer information