orrect?
> If so, is calling first unify_dictionaries() necessary?
>
> Also, if the operations only work on chunks is it up to the user to
> iterate through all chunks to create the resulting array of integers?
>
> Best,
>
> Laurent
>
>
> Le dim. 28 avr. 2024 à 14:28, Jacek
Hi!
table.column('a').chunk(0).dictionary returns dictionary values as an array
that you can map...
Then you can construct new Dictionary Type columns from the mapped values
and table.column('a').chunk(0).indices
using pa.DictionaryArray.from_arrays
BR
J
niedz., 28 kwi 2024 o 20:19 Laurent
the rows in those files, or the
>>> original order when the input is in memory) that will be respected if there
>>> are no joins or aggregates.
>>>
>>> On Tue, Apr 16, 2024 at 8:19 AM Aldrin wrote:
>>>
>>>> I think that ordering is only
mentioned in the docs.
I am using 12.0.1 due to Python 3.7 dependency.
Best Regards,
Jacek Pliszka
PS: Arrow 16 (the next release) is going to have almost-complete Azure
> > Data Lake FS support built-in [1] which might allow us to tweak the
> > way it interacts with Parquet reader more deeply.
> >
> > --
> > Felipe
> >
> > [1] https://github.com/apa
Hi!
I have noticed 2 things while using
pyarrow.dataset.dataset with ADLFS with parquet and I wonder if this is
something
worth opening a ticket for.
1. the first read is always 65536, then it is followed by read of the size
of parquet.
I wonder if there is a way to have the size of the first
pon., 4 gru 2023 o 14:41 Luca Maurelli napisał(a):
> Thank you @Jacek Pliszka for you feedback!
> Below my answers:
>
>
>
> *From:* Jacek Pliszka
> *Sent:* venerdì 1 dicembre 2023 17:35
> *To:* user@arrow.apache.org
> *Subject:* Re: Usage of Azure filesystem with fss
Hi!
These files seem to be below 4MB which is the default Azure block. Possibly
they are all read in full. Does someone know if in the approach below the
blob is read only once from Azure even if multiple reads are called?
Is there correlation between "my_index" and the filenames you could
tputStream()
pq.write_table(t, f)
b=f.getvalue()
ds = pq.ParquetDataset(b, filters=[['d', '=', 1]])
ds.read() # fails
funnily:
['d', '=', np.int16(1000)] works while
['d', '=', np.int16(1)] fails
Best Regards,
Jacek
czw., 16 lis 2023 o 13:42 Jacek Pliszka napisał(a):
>
> Hi!
>
> I foun
rs=[['dec', '==', pc.cast(8024,
pa.decimal128(38,10))]])
Does someone know what happened?
It looks kind of strange that it works for np.int16 and decimal but not int64.
And 29 seems confusing as 2**64<10**20 and 38-10=28 > 20
Thanks for any help,
Jacek Pliszka
Hi!
I got surprising results when comparing numpy and pyarrow performance.
val = np.uint8(115)
numpy has similar speed if using 115 and np.uint8(115):
%timeit np.count_nonzero(data_np == val)
591 µs ± 3.56 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit
Hi!
I am looking for an efficient way of working with pieces of a Table
Let's say I got a table with 100M rows with a key column having 100
different values.
I would like to be able to quickly get a subtable with just rows for
the given key value.
Currently I run filter 100 times to generate
to re-write the extracts from Postgres. Is there an
> easy way to partition the results of a SQL query or would I need to write
> something?
>
> Many thanks
>
> Adrian
>
> On Fri, 7 Jul 2023 at 12:53, Jacek Pliszka wrote:
>>
>> Hi!
>>
>> If you ha
Hi!
If you have any influence over how data is dumped from postgres - my
suggestion is to have it already partitioned then.
This would make parallelization much easier,
BR
Jacek
pt., 7 lip 2023 o 12:21 Adrian Mowat napisał(a):
>
> Hi,
>
> TL;DR: newbie question. I have an pyarrow program
; good file format for this problem.
>
> It naturally supports slicing like fp['field1'][1000:5000], provides chunking
> and compression, new arrays can be appended... Maybe Arrow is just not the
> right tool for this specific problem.
>
> Kind regards,
>
> Ramon.
>
&g
Hi!
I am not sure if this would solve your problem:
pa.concat_tables([pa.Table.from_pydict({'v': v}).append_column('f',
[len(v)*[f]]) for f, v in x.items()])
pyarrow.Table
v: double
f: string
v: [[0.2,0.2,0.2,0.1,0.2,0,0.8,0.7],[0.3,0.5,0.1],[0.9,nan,nan,0.1,0.5]]
f:
Hi!
I think API section is more user friendly:
https://arrow.apache.org/docs/python/api/compute.html#api-compute
https://arrow.apache.org/docs/python/generated/pyarrow.compute.binary_join_element_wise.html#pyarrow.compute.binary_join_element_wise
BR
J
pon., 26 wrz 2022 o 23:48 Ian Cook
Re 2. In Python Azure SDK there is logic for partial blob read:
https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobclient?view=azure-python#azure-storage-blob-blobclient-query-blob
However I was unable to use it as it does not support parquet files
with
Hi!
Why don't you use arrow Table join directly ?
https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.join
Though you need to be careful with join order as speed may be differ
depending on order of the joined tables.
BR,
Jacek
czw., 15 wrz 2022 o 06:15 Weston
Should be - if you need cast...
t.column(i).cast(..) uses arrow cast..
BR,
Jacek
pon., 1 mar 2021 o 17:04 Jacek Pliszka napisał(a):
>
> Use np.column_stack and list comprehension:
>
> t = pq.read_table('a.pq')
> matrix = np.column_stack([t.column(i) for i in range(t.num_column
((data.num_rows,data.num_columns),dtype=np.bool_)
> for i,col in enumerate(data.columns):
> matrix[:,i] = col
>
>
>
>
> Le lundi 01 mars 2021 à 11:31 +0100, Jacek Pliszka a écrit :
> > Other will probably give you better hints but
> >
> > You do not need to conver
> Anyway
>
> Have a goo day
>
> Le vendredi 12 février 2021 à 15:26 +0100, Jacek Pliszka a écrit :
> > Sure - I believe you can do it even in pandas - you have columns
> > parameter: pd.read_parquet('f.pq', columns=['A', 'B'])
> >
> > arrow is more us
Sure - I believe you can do it even in pandas - you have columns
parameter: pd.read_parquet('f.pq', columns=['A', 'B'])
arrow is more useful if you need to do some conversion of filtering.
BR,
Jacek
pt., 12 lut 2021 o 15:21 jonathan mercier
napisał(a):
>
> Dear,
> I have a parquet files with
I believe it would be good if you define your use case.
I do handle larger than memory datasets with pyarrow with the use of
dataset.scan but my use case is very specific as I am repartitioning
and cleaning a bit large datasets.
BR,
Jacek
czw., 22 paź 2020 o 20:39 Jacob Zelko napisał(a):
>
>
24 matches
Mail list logo