Dear, I try to studies 300 000 samples of SARS-Cov 2 with parquet/pyarrow thus I own a table with 300 000 columns and around 45 000 row of presence/absence (0/1). It is a file of ~150 Mo.
I read this file like this: import pyarrow.parquet as pq data = pq.read_table(dataset_path).to_pandas().to_numpy().astype(numpy.bool_) And this statement take 1 hour … So is there a trick to speedup to load in memory those data ? Is it possible to distribute the loading with a library such as ray ? thanks Best regards -- Researcher computational biology PhD, Jonathan MERCIER Bioinformatics (LBI) 2, rue Gaston Crémieux 91057 Evry Cedex Tel :(+33)1 60 87 83 44 Email :[email protected]
