Hi, I had a few questions regarding pyarrow.parquet. I want to write a Parquet dataset which is partitioned according to one column. I have a large csv file and I'm using chunks of csv using the following code :
# csv_to_parquet.py import pandas as pdimport pyarrow as paimport pyarrow.parquet as pq csv_file = '/path/to/my.tsv' parquet_file = '/path/to/my.parquet' chunksize = 100_000 csv_stream = pd.read_csv(csv_file, sep='\t', chunksize=chunksize, low_memory=False) for i, chunk in enumerate(csv_stream): print("Chunk", i) if i == 0: # Guess the schema of the CSV file from the first chunk parquet_schema = pa.Table.from_pandas(df=chunk).schema # Open a Parquet file for writing parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy') # Write CSV chunk to the parquet file table = pa.Table.from_pandas(chunk, schema=parquet_schema) parquet_writer.write_table(table) parquet_writer.close() But this code writes a single parquet file and I don't see any method in Parquet writer to write to a dataset, It just has the write_table method. Is there a way to do this ? Also how do I write the metadata file in the example mentioned above and the common metadata file as well as the metadata files in case of a partitioned dataset? Thanks in advanced. -- *Regards,* *Palak Harwani*