Re: why that take so many times to read parquets file with 300 000 columns

2021-03-01 Thread Micah Kornfield
Hi Jonathan, This won't directly help your situation but Parquet generally scales better with fewer columns and more rows, so at least transposing the data would help with load time (I also agree that modelling with even fewer columns as suggested above would help even more). What version of

Re: why that take so many times to read parquets file with 300 000 columns

2021-03-01 Thread Jacek Pliszka
Should be - if you need cast... t.column(i).cast(..) uses arrow cast.. BR, Jacek pon., 1 mar 2021 o 17:04 Jacek Pliszka napisał(a): > > Use np.column_stack and list comprehension: > > t = pq.read_table('a.pq') > matrix = np.column_stack([t.column(i) for i in range(t.num_columns)]) > > If you

Re: why that take so many times to read parquets file with 300 000 columns

2021-03-01 Thread Jacek Pliszka
Use np.column_stack and list comprehension: t = pq.read_table('a.pq') matrix = np.column_stack([t.column(i) for i in range(t.num_columns)]) If you need case - use pyarrow or numpy one - depending on your case. BR, Jacek pon., 1 mar 2021 o 14:07 jonathan mercier napisał(a): > > Thanks for the

Re: why that take so many times to read parquets file with 300 000 columns

2021-03-01 Thread Fernando Herrera
@jonathan. If your file is already a parquet file you can read it with pandas using pd.read_parquet. If it isn't, I have found that it is better to use any other method to read the file and then create the dataframe. One you have it, save it as a Parquet with pandas.to_parquet. @jorge. The

Re: why that take so many times to read parquets file with 300 000 columns

2021-03-01 Thread Jorge Cardoso Leitão
I understand that this does not answer the question, but it may be worth pointing out regardless: if you control the writing, it may be more suitable to encode the columns and use a link list for the problem: encode each column by a number x and store the data as two columns. For example: id, x0,

Re: why that take so many times to read parquets file with 300 000 columns

2021-03-01 Thread jonathan mercier
Thanks for the hint. I do not saw a to_numpy method from Tabl object so I think I have to do it manually in python something like: python3 import pyarrow.parquet as pq import numpy as np data = pq.read_table(dataset_path') matrix = np.zeros((data.num_rows,data.num_columns),dtype=np.bool_)

why that take so many times to read parquets file with 300 000 columns

2021-03-01 Thread jonathan mercier
Dear, I try to studies 300 000 samples of SARS-Cov 2 with parquet/pyarrow thus I own a table with 300 000 columns and around 45 000 row of presence/absence (0/1). It is a file of ~150 Mo. I read this file like this: import pyarrow.parquet as pq data =