[GitHub] [arrow] joosthooz commented on pull request #14032: ARROW-17583: [C++][Python] Changed datawidth of WrittenFile.size to int64 to match C++ code

GitBox Fri, 02 Sep 2022 06:14:42 -0700


joosthooz commented on PR #14032:
URL: https://github.com/apache/arrow/pull/14032#issuecomment-1235492183


   I'm using this script to reproduce the problem: 
   ```
   import os
   import pyarrow as pa
   import pyarrow.dataset as ds
   import tempfile
   
   def file_visitor(written_file):
       print(f"path={written_file.path}")
       print(f"metadata={written_file.metadata}")
   
   with tempfile.TemporaryDirectory() as path:
       with open(f"{path}/part-0.csv", "w") as f:
           for i in range(2**22): # 4M values
               f.write(f"{i%123}\n")
   
       for add_part in range(1, 1000):
           os.symlink(f"{path}/part-0.csv", f"{path}/part-{add_part}.csv")
   
       d = ds.dataset(f"{path}", format=ds.CsvFileFormat())
       print(d.schema)
       outfile = f"{path}/pqfile.parquet"
       dataset_write_format = ds.ParquetFileFormat()
       write_options = dataset_write_format.make_write_options(compression=None)
       ds.write_dataset(
           d.scanner(),
           outfile,
           format=dataset_write_format,
           file_options=write_options,
           file_visitor=file_visitor
       )
       print("output file size: " + 
str(os.path.getsize(f"{outfile}/part-0.parquet")))
   ```
   It's a bit cumbersome and takes a minute or so, so I don't think it is 
suitable to add as a unit test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] joosthooz commented on pull request #14032: ARROW-17583: [C++][Python] Changed datawidth of WrittenFile.size to int64 to match C++ code

Reply via email to