marklit opened a new issue, #15220: URL: https://github.com/apache/arrow/issues/15220
### Describe the enhancement requested The following was run on Ubuntu 20 on a `e2-highcpu-32` GCP VM with 32 GB of RAM and 32 vCPUs. I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet with pyarrow and I attempted to do the same with fastparquet. ```bash $ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \ | jq -c '.properties * {geom: .geometry|tostring}' \ > California.jsonl $ head -n1 California.jsonl | jq . ``` ```json { "release": 1, "capture_dates_range": "", "geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}" } ``` PyArrow is able to produce a 794 MB Parquet file in 49.86 seconds. ```bash /usr/bin/time -v \ python3 -c "import pandas as pd; pd.read_json('California.jsonl', lines=True).to_parquet('pandas.pyarrow.snappy.pq', row_group_size=37738, engine='pyarrow')" ``` With ClickHouse I'm able to complete the same task in 18.35 seconds. ``` $ /usr/bin/time -v \ clickhouse local \ --input-format JSONEachRow \ -q "SELECT * FROM table FORMAT Parquet" \ < California.jsonl \ > ch.snappy.pq ``` The resulting PyArrow Parquet file matches ClickHouse in terms of row groups and using snappy compression. ``` <pyarrow._parquet.FileMetaData object at 0x7f8edab544f0> created_by: parquet-cpp-arrow version 10.0.1 num_columns: 3 num_rows: 11542912 num_row_groups: 306 format_version: 2.6 serialized_size: 228114 ``` ``` <pyarrow._parquet.FileMetaData object at 0x7f0926d54860> created_by: parquet-cpp version 1.5.1-SNAPSHOT num_columns: 3 num_rows: 11542912 num_row_groups: 306 format_version: 1.0 serialized_size: 228389 ``` The ClickHouse-produced Parquet file is 19,979 bytes larger than the PyArrow-produced file. These are the versions of software involved: * pandas-1.5.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl * pyarrow-10.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl * ClickHouse 22.13.1.1361 (official build) Below is a flame graph from PyArrow's execution. ![parquet pyarrow snappy](https://user-images.githubusercontent.com/359316/211012988-6dc96d97-fea8-445a-b6f2-9ba648301ceb.svg) ### Component(s) Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org