[GitHub] [arrow] kato1208 commented on issue #13142: write_batch vs write_table of ParquetWriter

GitBox Fri, 13 May 2022 00:42:13 -0700


kato1208 commented on issue #13142:
URL: https://github.com/apache/arrow/issues/13142#issuecomment-1125752287


   Thank you for your ansewer.
   I understood that in our case it is better to use record batch.
   
   > I'm not really sure what you mean by this. Are you trying to write the 
data so it can be read back out a piece at a time?
   
   I'm working on a parquet conversion of huge protobuf data using pyarrow.
   We don't want to read the protobuf data all at once and store it in memory, 
so we want to read it little by little and convert it to parquet.
   For example:
   ```
   records = []
   for record in protobuf_obj.read():
       records.append(record)
       if sys.getsizeof(records) > threshhold:
           batch = pa.RecordBatch.from_pylist(records, schema=schema)
           writer.write_batch(batch)
           records = []
   ```
   I don't know how to determine this threshhold.
   Or is this approach not appropriate?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] kato1208 commented on issue #13142: write_batch vs write_table of ParquetWriter

Reply via email to