[jira] [Created] (ARROW-7305) High memory usage writing pyarrow.Table to parquet

2019-12-03 Thread Bogdan Klichuk (Jira)
Bogdan Klichuk created ARROW-7305:
-

 Summary: High memory usage writing pyarrow.Table to parquet
 Key: ARROW-7305
 URL: https://issues.apache.org/jira/browse/ARROW-7305
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Affects Versions: 0.15.1
 Environment: Mac OSX
Reporter: Bogdan Klichuk


My case of datasets stored is specific. I have large strings (1-100MB each).

Let's take for example a single row.

43mb.csv is a 1-row CSV with 10 columns. One column a 43mb string.

When I read this csv with pandas and then dump to parquet, my script consumes 
10x of the 43mb.

With increasing amount of such rows memory footprint overhead diminishes, but I 
want to focus on this specific case.

Here's the footprint after running using memory profiler:
{code:java}
Line #Mem usageIncrement   Line Contents

 4 48.9 MiB 48.9 MiB   @profile
 5 def test():
 6143.7 MiB 94.7 MiB   data = pd.read_csv('43mb.csv')
 7498.6 MiB354.9 MiB   data.to_parquet('out.parquet')
 {code}
Is this typical for parquet in case of big strings?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7150) [Python] Explain parquet file size growth

2019-11-12 Thread Bogdan Klichuk (Jira)
Bogdan Klichuk created ARROW-7150:
-

 Summary: [Python] Explain parquet file size growth
 Key: ARROW-7150
 URL: https://issues.apache.org/jira/browse/ARROW-7150
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Affects Versions: 0.14.1
 Environment: Mac OS X. Pyarrow==0.15.1
Reporter: Bogdan Klichuk


Having columnar storage format in mind, with gzip compression enabled, I can't 
make sense of how parquet file size is growing in my specific example.

So far without sharing a dataset (would need to create a mock one to share).
{code:java}

> df = pandas.read_csv('...')
> len(df)
820
> df.to_parquet('820.parquet', compression='gzip)
> # size of 820.parquet is 6.1M
> df_big = pandas.concat([df] * 10).reset_index(drop=True)
> len(df_big)
8200
> df_big.to_parquet('8200.parquet', compression='gzip')
> # size of 800.parquet is 320M.
 {code}
 

 

Compression works better on bigger files. How come 10x1 increase with repeated 
data resulted in 50x growth of file? Insane imo.

 

Working on a periodic job that concats smaller files into bigger ones and 
doubting now whether I need this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6481) Bad performance of read_csv() with column_types

2019-09-07 Thread Bogdan Klichuk (Jira)
Bogdan Klichuk created ARROW-6481:
-

 Summary: Bad performance of read_csv() with column_types
 Key: ARROW-6481
 URL: https://issues.apache.org/jira/browse/ARROW-6481
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
 Environment: ubuntu xenial
Reporter: Bogdan Klichuk
 Attachments: 20k_cols.csv

Case: Dataset wit 20k columns. Amount of rows can be 0.

`pyarrow.csv.read_csv()` works rather fine if no convert_options provided.

Took 700ms.

Now I call `read_csv()` with column types mapping that marks 2000 out of these 
columns as string.

`pyarrow.csv.read_csv('20k_cols.csv', 
convert_options=pyarrow.csv.ConvertOptions(column_types=\{'K%d' % i: 
pyarrow.string() for i in range(2000)}))`

(K1..K1 are column names in attached dataset).

My task globally is to read everything as string, avoid any inferring.

This takes several minutes, consumes around 4GB memory.

This doesn't look sane at all.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-5811) pyarrow.csv.read_csv: Ability to not infer column types.

2019-06-30 Thread Bogdan Klichuk (JIRA)
Bogdan Klichuk created ARROW-5811:
-

 Summary: pyarrow.csv.read_csv: Ability to not infer column types.
 Key: ARROW-5811
 URL: https://issues.apache.org/jira/browse/ARROW-5811
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.13.0
 Environment: Ubuntu Xenial
Reporter: Bogdan Klichuk


I'm trying to read CSV as is. All columns as strings. I don't know the schema 
of these CSVs and they will vary as they are provided by user.

Right now i'm using pandas.read_csv(dtype=str) which works great, but since 
final destination of these CSVs are parquet files it seems like much more 
efficient to use pyarrow.csv.read_csv in future, as soon as this becomes 
available :)

I tried things like 
`pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda:
 'string')))` but it doesn't work.

Maybe I just didnt' find something that already exists? :)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5791) pyarrow.csv.read_csv hangs + eats all RAM

2019-06-29 Thread Bogdan Klichuk (JIRA)
Bogdan Klichuk created ARROW-5791:
-

 Summary: pyarrow.csv.read_csv hangs + eats all RAM
 Key: ARROW-5791
 URL: https://issues.apache.org/jira/browse/ARROW-5791
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.13.0
 Environment: Ubuntu Xenial, python 2.7
Reporter: Bogdan Klichuk
 Attachments: csvtest.py, graph.svg, sample_32768_cols.csv, 
sample_32769_cols.csv

I have quite a sparse dataset in CSV format. A wide table that has several rows 
but many (32k) columns. Total size ~540K.

When I read the dataset using `pyarrow.csv.read_csv` it hangs, gradually eats 
all memory and gets killed.

More details on the conditions further. Script to run and all mentioned files 
are under attachments.

1) `sample_32769_cols.csv` is the dataset that suffers the problem.

2) `sample_32768_cols.csv` is the dataset that DOES NOT suffer and is read in 
under 400ms on my machine. It's the same dataset without ONE last column. That 
last column is no different than others and has empty values.

The reason of why exactly this column makes difference between proper execution 
and hanging failure which looks like some memory leak - I don't know.

I have created flame graph for the case (1) to support this issue resolution 
(`graph.svg`).

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)