albertvillanova opened a new issue, #25041:
URL: https://github.com/apache/beam/issues/25041
### What happened?
After the 2.44.0 release, we have found an issue when writing to Parquet
using shards: some files have 0 number of rows.
Steps:
```python
with beam.Pipeline() as p:
records = p | 'Read' >> beam.Create(
[{'name': 'foo', 'age': 10}, {'name': 'bar', 'age': 20}]
)
_ = records | 'Write' >> beam.io.WriteToParquet("filename",
pyarrow.schema(
[('name', pyarrow.binary()), ('age', pyarrow.int64())]
), num_shards=2
)
for filename in ["filename-00000-of-00002", "filename-00001-of-00002"]:
parquet_file = pyarrow.parquet.ParquetFile(filename)
print(filename)
print(parquet_file.metadata)
print()
```
We get one of the files has 0 number of rows:
```
filename-00000-of-00002
<pyarrow._parquet.FileMetaData object at 0x7f42d2362810>
created_by: parquet-cpp-arrow version 9.0.0
num_columns: 2
num_rows: 2
num_row_groups: 1
format_version: 2.6
serialized_size: 514
filename-00001-of-00002
<pyarrow._parquet.FileMetaData object at 0x7f42d2063680>
created_by: parquet-cpp-arrow version 9.0.0
num_columns: 2
num_rows: 0
num_row_groups: 0
format_version: 2.6
serialized_size: 340
```
Before (in 2.43.0 version), none of the files had 0 number of rows:
```
filename-00000-of-00002
<pyarrow._parquet.FileMetaData object at 0x7f673a4dcb30>
created_by: parquet-cpp-arrow version 9.0.0
num_columns: 2
num_rows: 1
num_row_groups: 1
format_version: 2.6
serialized_size: 512
filename-00001-of-00002
<pyarrow._parquet.FileMetaData object at 0x7f6738cf3950>
created_by: parquet-cpp-arrow version 9.0.0
num_columns: 2
num_rows: 1
num_row_groups: 1
format_version: 2.6
serialized_size: 512
```
### Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
### Issue Components
- [X] Component: Python SDK
- [ ] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]