[GitHub] [arrow] marklit opened a new issue, #15220: Speed up Parquet Writing?

GitBox Fri, 06 Jan 2023 04:31:53 -0800


marklit opened a new issue, #15220:
URL: https://github.com/apache/arrow/issues/15220


   ### Describe the enhancement requested
   
   The following was run on Ubuntu 20 on a `e2-highcpu-32` GCP VM with 32 GB of 
RAM and 32 vCPUs.
   
   I downloaded the California dataset from 
https://github.com/microsoft/USBuildingFootprints and converted it from JSONL 
into Parquet with pyarrow and I attempted to do the same with fastparquet.
   
   ```bash
   $ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
       | jq -c '.properties * {geom: .geometry|tostring}' \
       > California.jsonl
   $ head -n1 California.jsonl | jq .
   ```
   
   ```json
   {
     "release": 1,
     "capture_dates_range": "",
     "geom": 
"{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
   }
   ```
   
   PyArrow is able to produce a 794 MB Parquet file in 49.86 seconds.
   
   ```bash
   /usr/bin/time -v \
       python3 -c "import pandas as pd; pd.read_json('California.jsonl', 
lines=True).to_parquet('pandas.pyarrow.snappy.pq', row_group_size=37738, 
engine='pyarrow')"
   ```
   
   With ClickHouse I'm able to complete the same task in 18.35 seconds.
   
   ```
   $ /usr/bin/time -v \
       clickhouse local \
             --input-format JSONEachRow \
             -q "SELECT *
                 FROM table
                 FORMAT Parquet" \
       < California.jsonl \
       > ch.snappy.pq
   ```
   
   The resulting PyArrow Parquet file matches ClickHouse in terms of row groups 
and using snappy compression.
   
   ```
   <pyarrow._parquet.FileMetaData object at 0x7f8edab544f0>
     created_by: parquet-cpp-arrow version 10.0.1
     num_columns: 3
     num_rows: 11542912
     num_row_groups: 306
     format_version: 2.6
     serialized_size: 228114
   ```
   
   ```
   <pyarrow._parquet.FileMetaData object at 0x7f0926d54860>
     created_by: parquet-cpp version 1.5.1-SNAPSHOT
     num_columns: 3
     num_rows: 11542912
     num_row_groups: 306
     format_version: 1.0
     serialized_size: 228389
   ```
   
   The ClickHouse-produced Parquet file is 19,979 bytes larger than the 
PyArrow-produced file.
   
   These are the versions of software involved:
   
   * pandas-1.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
   * pyarrow-10.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
   * ClickHouse 22.13.1.1361 (official build)
   
   Below is a flame graph from PyArrow's execution.
   
   ![parquet pyarrow 
snappy](https://user-images.githubusercontent.com/359316/211012988-6dc96d97-fea8-445a-b6f2-9ba648301ceb.svg)
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] marklit opened a new issue, #15220: Speed up Parquet Writing?

Reply via email to