why that take so many times to read parquets file with 300 000 columns

jonathan mercier Mon, 01 Mar 2021 02:25:24 -0800

Dear,

I try to studies 300 000 samples of SARS-Cov 2 with parquet/pyarrow
thus I own a table with 300 000 columns and around 45 000 row of
presence/absence (0/1). It is a  file of ~150 Mo.


I read this file like this:

import pyarrow.parquet as pq
data =
pq.read_table(dataset_path).to_pandas().to_numpy().astype(numpy.bool_)

And this statement take 1 hour …
So is there a trick to speedup to load in memory those data ?
Is it possible to distribute the loading with a library such as ray ?

thanks

Best regards


-- 
                Researcher computational biology
                PhD, Jonathan MERCIER
            
                Bioinformatics (LBI)
                2, rue Gaston
                Crémieux
                91057 Evry Cedex
            
            
                Tel :(+33)1 60 87 83 44
                Email :[email protected]

why that take so many times to read parquets file with 300 000 columns

Reply via email to