[jira] [Created] (ARROW-7305) High memory usage writing pyarrow.Table to parquet
Bogdan Klichuk created ARROW-7305: - Summary: High memory usage writing pyarrow.Table to parquet Key: ARROW-7305 URL: https://issues.apache.org/jira/browse/ARROW-7305 Project: Apache Arrow Issue Type: Task Components: Python Affects Versions: 0.15.1 Environment: Mac OSX Reporter: Bogdan Klichuk My case of datasets stored is specific. I have large strings (1-100MB each). Let's take for example a single row. 43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string. When I read this csv with pandas and then dump to parquet, my script consumes 10x of the 43mb. With increasing amount of such rows memory footprint overhead diminishes, but I want to focus on this specific case. Here's the footprint after running using memory profiler: {code:java} Line #Mem usageIncrement Line Contents 4 48.9 MiB 48.9 MiB @profile 5 def test(): 6143.7 MiB 94.7 MiB data = pd.read_csv('43mb.csv') 7498.6 MiB354.9 MiB data.to_parquet('out.parquet') {code} Is this typical for parquet in case of big strings? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7150) [Python] Explain parquet file size growth
Bogdan Klichuk created ARROW-7150: - Summary: [Python] Explain parquet file size growth Key: ARROW-7150 URL: https://issues.apache.org/jira/browse/ARROW-7150 Project: Apache Arrow Issue Type: Task Components: Python Affects Versions: 0.14.1 Environment: Mac OS X. Pyarrow==0.15.1 Reporter: Bogdan Klichuk Having columnar storage format in mind, with gzip compression enabled, I can't make sense of how parquet file size is growing in my specific example. So far without sharing a dataset (would need to create a mock one to share). {code:java} > df = pandas.read_csv('...') > len(df) 820 > df.to_parquet('820.parquet', compression='gzip) > # size of 820.parquet is 6.1M > df_big = pandas.concat([df] * 10).reset_index(drop=True) > len(df_big) 8200 > df_big.to_parquet('8200.parquet', compression='gzip') > # size of 800.parquet is 320M. {code} Compression works better on bigger files. How come 10x1 increase with repeated data resulted in 50x growth of file? Insane imo. Working on a periodic job that concats smaller files into bigger ones and doubting now whether I need this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6481) Bad performance of read_csv() with column_types
Bogdan Klichuk created ARROW-6481: - Summary: Bad performance of read_csv() with column_types Key: ARROW-6481 URL: https://issues.apache.org/jira/browse/ARROW-6481 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.14.1 Environment: ubuntu xenial Reporter: Bogdan Klichuk Attachments: 20k_cols.csv Case: Dataset wit 20k columns. Amount of rows can be 0. `pyarrow.csv.read_csv()` works rather fine if no convert_options provided. Took 700ms. Now I call `read_csv()` with column types mapping that marks 2000 out of these columns as string. `pyarrow.csv.read_csv('20k_cols.csv', convert_options=pyarrow.csv.ConvertOptions(column_types=\{'K%d' % i: pyarrow.string() for i in range(2000)}))` (K1..K1 are column names in attached dataset). My task globally is to read everything as string, avoid any inferring. This takes several minutes, consumes around 4GB memory. This doesn't look sane at all. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (ARROW-5811) pyarrow.csv.read_csv: Ability to not infer column types.
Bogdan Klichuk created ARROW-5811: - Summary: pyarrow.csv.read_csv: Ability to not infer column types. Key: ARROW-5811 URL: https://issues.apache.org/jira/browse/ARROW-5811 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 0.13.0 Environment: Ubuntu Xenial Reporter: Bogdan Klichuk I'm trying to read CSV as is. All columns as strings. I don't know the schema of these CSVs and they will vary as they are provided by user. Right now i'm using pandas.read_csv(dtype=str) which works great, but since final destination of these CSVs are parquet files it seems like much more efficient to use pyarrow.csv.read_csv in future, as soon as this becomes available :) I tried things like `pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda: 'string')))` but it doesn't work. Maybe I just didnt' find something that already exists? :) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5791) pyarrow.csv.read_csv hangs + eats all RAM
Bogdan Klichuk created ARROW-5791: - Summary: pyarrow.csv.read_csv hangs + eats all RAM Key: ARROW-5791 URL: https://issues.apache.org/jira/browse/ARROW-5791 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.13.0 Environment: Ubuntu Xenial, python 2.7 Reporter: Bogdan Klichuk Attachments: csvtest.py, graph.svg, sample_32768_cols.csv, sample_32769_cols.csv I have quite a sparse dataset in CSV format. A wide table that has several rows but many (32k) columns. Total size ~540K. When I read the dataset using `pyarrow.csv.read_csv` it hangs, gradually eats all memory and gets killed. More details on the conditions further. Script to run and all mentioned files are under attachments. 1) `sample_32769_cols.csv` is the dataset that suffers the problem. 2) `sample_32768_cols.csv` is the dataset that DOES NOT suffer and is read in under 400ms on my machine. It's the same dataset without ONE last column. That last column is no different than others and has empty values. The reason of why exactly this column makes difference between proper execution and hanging failure which looks like some memory leak - I don't know. I have created flame graph for the case (1) to support this issue resolution (`graph.svg`). -- This message was sent by Atlassian JIRA (v7.6.3#76005)