Hi Jonathan,
This won't directly help your situation but Parquet generally scales better
with fewer columns and more rows, so at least transposing the data would
help with load time (I also agree that modelling with even fewer columns as
suggested above would help even more).
What version of
Should be - if you need cast...
t.column(i).cast(..) uses arrow cast..
BR,
Jacek
pon., 1 mar 2021 o 17:04 Jacek Pliszka napisał(a):
>
> Use np.column_stack and list comprehension:
>
> t = pq.read_table('a.pq')
> matrix = np.column_stack([t.column(i) for i in range(t.num_columns)])
>
> If you
Use np.column_stack and list comprehension:
t = pq.read_table('a.pq')
matrix = np.column_stack([t.column(i) for i in range(t.num_columns)])
If you need case - use pyarrow or numpy one - depending on your case.
BR,
Jacek
pon., 1 mar 2021 o 14:07 jonathan mercier
napisał(a):
>
> Thanks for the
@jonathan. If your file is already a parquet file you can read it with
pandas using pd.read_parquet. If it isn't, I have found that it is better
to use any other method to read the file and then create the dataframe. One
you have it, save it as a Parquet with pandas.to_parquet.
@jorge. The
I understand that this does not answer the question, but it may be worth
pointing out regardless: if you control the writing, it may be more
suitable to encode the columns and use a link list for the problem: encode
each column by a number x and store the data as two columns. For example:
id, x0,
Thanks for the hint.
I do not saw a to_numpy method from Tabl object so I think I have to do
it manually in python
something like:
python3
import pyarrow.parquet as pq
import numpy as np
data = pq.read_table(dataset_path')
matrix = np.zeros((data.num_rows,data.num_columns),dtype=np.bool_)
Dear,
I try to studies 300 000 samples of SARS-Cov 2 with parquet/pyarrow
thus I own a table with 300 000 columns and around 45 000 row of
presence/absence (0/1). It is a file of ~150 Mo.
I read this file like this:
import pyarrow.parquet as pq
data =