joosthooz commented on PR #14032:
URL: https://github.com/apache/arrow/pull/14032#issuecomment-1235492183
I'm using this script to reproduce the problem:
```
import os
import pyarrow as pa
import pyarrow.dataset as ds
import tempfile
def file_visitor(written_file):
print(f"path={written_file.path}")
print(f"metadata={written_file.metadata}")
with tempfile.TemporaryDirectory() as path:
with open(f"{path}/part-0.csv", "w") as f:
for i in range(2**22): # 4M values
f.write(f"{i%123}\n")
for add_part in range(1, 1000):
os.symlink(f"{path}/part-0.csv", f"{path}/part-{add_part}.csv")
d = ds.dataset(f"{path}", format=ds.CsvFileFormat())
print(d.schema)
outfile = f"{path}/pqfile.parquet"
dataset_write_format = ds.ParquetFileFormat()
write_options = dataset_write_format.make_write_options(compression=None)
ds.write_dataset(
d.scanner(),
outfile,
format=dataset_write_format,
file_options=write_options,
file_visitor=file_visitor
)
print("output file size: " +
str(os.path.getsize(f"{outfile}/part-0.parquet")))
```
It's a bit cumbersome and takes a minute or so, so I don't think it is
suitable to add as a unit test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]