[ https://issues.apache.org/jira/browse/ARROW-15045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458366#comment-17458366 ]
Joris Van den Bossche commented on ARROW-15045: ----------------------------------------------- Thanks for that clarification! Now it is clearer what you exactly did with the union dataset. Since this is not that trivial to reproduce locally, would you be able to try to run your code with a debugger ({{gdb}}, to see if you can get a stack trace of when it is crashing (see eg https://stackoverflow.com/a/49414907/653364). That might give useful information to understand the cause of the crash. bq. I just checked the data folder details, it's 28.3GB for 880219 files, so yeah, sorry for that mistake. Aside, this still means you only have on average files of around 30kb if I calculated that correctly. Generally, that's considered very small, and certainly for Parquet files (given the metadata overhead for parquet files). I am not fully sure if we have an option to disable reading Parquet metadata to reduce the memory usage. > PyArrow SIGSEGV error when using UnionDatasets > ---------------------------------------------- > > Key: ARROW-15045 > URL: https://issues.apache.org/jira/browse/ARROW-15045 > Project: Apache Arrow > Issue Type: Bug > Components: Python > Affects Versions: 6.0.1 > Environment: Fedora Linux 35 (Workstation Edition), AMD Ryzen 5950X. > Reporter: Thomas Cercato > Priority: Blocker > Labels: dataset > > h3. The context: > I am using PyArrow to read a folder structured as > {{exchange/symbol/date.parquet}}. The folder contains multiple exchanges, > multiple symbols and multiple files. At the time I am writing the folder is > about 30GB/1.85M files. > If I use a single PyArrow Dataset to read/manage the entire folder, the > simplest process with just the dataset defined will occupy 2.3GB of RAM. The > problem is, I am instanciating this dataset on multiple processes but since > every process only needs some exchanges (typically just one), I don't need to > read all folders and files in every single process. > So I tried to use a UnionDataset composed of single exchange Dataset. In this > way, every process just loads the required folder/files as a dataset. By a > simple test, by doing so every process now occupy just 868MB of RAM, -63%. > h3. The problem: > When using a single Dataset for the entire folder/files, I have no problem at > all. I can read filtered data without problems and it's fast as duck. > But when I read the UnionDataset filtered data, I always get {{Process > finished with exit code 139 (interrupted by signal 11: SIGSEGV}} error. So > after looking every single source of the problem, I noticed that if I create > a dummy folder with multiple exchanges but just some symbols, in order to > limit the files amout to read, I don't get that error and it works normally. > If I then copy new symbols folders (any) I get again that error. > I came up thinking that the problem is not about my code, but linked instead > to the amout of files that the UnionDataset is able to manage. > Am I correct or am I doing something wrong? Thank you all, have a nice day > and good work. -- This message was sent by Atlassian Jira (v8.20.1#820001)