Order of module loading in the Dataflow Python runner

Dmitry Demeshchuk Fri, 07 Jul 2017 17:02:34 -0700

Hey folks,

So, I've been trying to run this custom S3 filesystem of mine on the
Dataflow runner. Works fine on the direct runner, even produces the correct
results. :)


However, when I run it on Dataflow (it's basically a wordcount example that
has the gs:// paths replaced with s3:// paths), I see the following errors:


(b84963e5767747e2): Traceback (most recent call last): File
"/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
line 581, in do_work work_executor.execute() File
"/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line
209, in execute self._split_task) File
"/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line
217, in _perform_source_split_considering_api_limits desired_bundle_size)
File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py",
line 254, in _perform_source_split for split in
source.split(desired_bundle_size): File
"/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsource.py",
line 172, in split return self._get_concat_source().split( File
"/usr/local/lib/python2.7/dist-packages/apache_beam/options/value_provider.py",
line 109, in _f return fnc(self, *args, **kwargs) File
"/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsource.py",
line 119, in _get_concat_source match_result =
FileSystems.match([pattern])[0] File
"/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py",
line 130, in match filesystem = FileSystems.get_filesystem(patterns[0])
File
"/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py",
line 61, in get_filesystem raise ValueError('Unable to get the Filesystem
for path %s' % path) ValueError: Unable to get the Filesystem for path
s3://pm-dataflow/test


To my understanding, this means that the module that contains my
S3FileSystem doesn't get imported (it currently resides in a separate
package that I ship to the pipeline). And, I guess, the order of modules
importing may differ depending on the runner, that's the only explanation
of why exactly the same pipeline works on the direct runner and produces
the correct results.

Any ideas of the best way to troubleshoot this? Can I somehow make Beam
upload some locally-modified version of the apache_beam package to
Dataflow? I'd be happy to provide more details if needed. Or should I
instead be reaching out to the Dataflow team?

Thank you.

-- 
Best regards,
Dmitry Demeshchuk.

Order of module loading in the Dataflow Python runner

Reply via email to