[ https://issues.apache.org/jira/browse/ARROW-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458340#comment-17458340 ]
Thomas Cercato commented on ARROW-15045: ---------------------------------------- This is an example tree of my data folder: data_dir ---- exchange_0_dir -------- symbol_0_dir ------------ 2017-01-01.parquet ------------ 2017-01-02.parquet ------------ date-n.parquet -------- symbol_1_dir ------------ 2017-01-01.parquet ------------ 2017-01-02.parquet ------------ date-n.parquet ---- exchange_1_dir --------symbol_0_dir ------------ 2017-01-01.parquet ------------ 2017-01-02.parquet ------------ date-n.parquet -------- symbol_1_dir ------------ 2017-01-01.parquet ------------ 2017-01-02.parquet ------------ date-n.parquet If I create a dataset as {{dataset(source='path/to/data_dir/', format='parquet', partitioning=partitioning(field_names=['exchange', 'asset'])}}, it reads all the exchange directories with their content and since I have tons of files, that simple instance occupy 2.3GB in memory per process. So I tried to create an UnionDataset as {{dataset(source=[dataset(source=exchange, format='parquet', partitioning=partitioning(field_names=['asset'])) for exchange in [exchange_0_dir, exchange_6_dir, exchange_9_dir]])}} and it returs that SIGSEGV error. I just checked the data folder details, it's *28.3GB for 880219 files*, so yeah, sorry for that mistake. > PyArrow SIGSEGV error when using UnionDatasets > ---------------------------------------------- > > Key: ARROW-15045 > URL: https://issues.apache.org/jira/browse/ARROW-15045 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 6.0.1 > Environment: Fedora Linux 35 (Workstation Edition), AMD Ryzen 5950X. > Reporter: Thomas Cercato > Priority: Blocker > Labels: dataset > > h3. The context: > I am using PyArrow to read a folder structured as > {{exchange/symbol/date.parquet}}. The folder contains multiple exchanges, > multiple symbols and multiple files. At the time I am writing the folder is > about 30GB/1.85M files. > If I use a single PyArrow Dataset to read/manage the entire folder, the > simplest process with just the dataset defined will occupy 2.3GB of RAM. The > problem is, I am instanciating this dataset on multiple processes but since > every process only needs some exchanges (typically just one), I don't need to > read all folders and files in every single process. > So I tried to use a UnionDataset composed of single exchange Dataset. In this > way, every process just loads the required folder/files as a dataset. By a > simple test, by doing so every process now occupy just 868MB of RAM, -63%. > h3. The problem: > When using a single Dataset for the entire folder/files, I have no problem at > all. I can read filtered data without problems and it's fast as duck. > But when I read the UnionDataset filtered data, I always get {{Process > finished with exit code 139 (interrupted by signal 11: SIGSEGV}} error. So > after looking every single source of the problem, I noticed that if I create > a dummy folder with multiple exchanges but just some symbols, in order to > limit the files amout to read, I don't get that error and it works normally. > If I then copy new symbols folders (any) I get again that error. > I came up thinking that the problem is not about my code, but linked instead > to the amout of files that the UnionDataset is able to manage. > Am I correct or am I doing something wrong? Thank you all, have a nice day > and good work. -- This message was sent by Atlassian Jira (v8.20.1#820001)