gecko655 created BEAM-10596:
-------------------------------
Summary: Sharding with fileio.WriteToFiles need to set
`max_writers_per_bundle=0`?
Key: BEAM-10596
URL: https://issues.apache.org/jira/browse/BEAM-10596
Project: Beam
Issue Type: Bug
Components: sdk-py-core
Affects Versions: 2.24.0
Environment: - Python 3.7.6
- `apache-beam==2.24.0.dev0`
- Reproducing is done in GCP's jupyter notebook environment.
https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development
Reporter: gecko655
h3. Description:
`fileio.WriteToFiles` ignores the option `shards=3` given to its constructor
unless I set `max_writers_per_bundle` to `0`
h3. Example:
Suppose I have the following pipeline (with interactive runner):
{code:python}
import apache_beam as beam
import apache_beam.io.fileio as fileio
import apache_beam.runners.interactive.interactive_beam as ib
user_ids = list(map(lambda x: 'user_id' + str(x), range(0, 10000)))
with beam.Pipeline(InteractiveRunner()) as pipeline:
user_list = pipeline | 'create pcollection' >> beam.Create(user_ids)
write_sharded_csv = user_list | 'write sharded csv files' >>
fileio.WriteToFiles(
path='/tmp/data/',
shards=3,
file_naming=fileio.default_file_naming(prefix='userlist',
suffix='.csv'),
# max_writers_per_bundle=0,
)
ib.show(write_sharded_csv)
{code}
This pipeline is implemented to...
- Creates PCollection of strings: 'user_id1', 'user_id2', ... 'user_id10000'
- Writes the user ids to 3 local files with sharding.
The code does not work as intended. It writes the user ids to only 1 file.
The code DOES work as intended after I added the `max_writers_per_bundle=0`
argument to the `WriteToFiles` constructor.
Is the behavior intentional or bug?
I couldn't understand why `max_writers_per_bundle` is related to the sharding
behavior. I couldn't find any documentation about this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)