Re: Order of module loading in the Dataflow Python runner

Chamikara Jayalath Fri, 07 Jul 2017 17:45:25 -0700

Discovery of file-system implementations should only occur after your
package is installed. But your class won't be discoverable until it is
imported somewhere.


I agree that we need a better mechanism for registering and discovering new
FileSystems. Would you mind filing a JIRA ticket for that ?

- Cham

On Fri, Jul 7, 2017 at 5:25 PM Dmitry Demeshchuk <[email protected]>
wrote:

> Thanks for the clarification, Cham!
>
> Does it mean that for now I'm pretty much stuck with a modified Beam
> distribution, unless/until that filesystem is a part of Beam, or until
> there's a better mechanism to discover new filesystem classes?
>
> Alternatively, maybe it's still possible to change the order of imports
> somehow, so that we initialize the FileSystems class after all the side
> packages are imported? That would be an even better solution, although it
> may not be possible due to circular dependency issues or whatnot.
>
> On Fri, Jul 7, 2017 at 5:14 PM, Chamikara Jayalath <[email protected]>
> wrote:
>
>> You should be able to use a locally modified version of Beam using
>> following.
>>
>> python setup.py sdist
>> When running your program use option '--sdk_location dist/<src dist>'
>> when running your program.
>>
>> Beam currently look for FileSystem implementations using the sub-class
>> check below.
>>
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py#L58
>>
>> For this to work you need to force import your FileSystem similar to
>> following.
>>
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py#L30
>>
>> Thanks,
>> Cham
>>
>> On Fri, Jul 7, 2017 at 5:01 PM Dmitry Demeshchuk <[email protected]>
>> wrote:
>>
>>> Hey folks,
>>>
>>> So, I've been trying to run this custom S3 filesystem of mine on the
>>> Dataflow runner. Works fine on the direct runner, even produces the correct
>>> results. :)
>>>
>>> However, when I run it on Dataflow (it's basically a wordcount example
>>> that has the gs:// paths replaced with s3:// paths), I see the following
>>> errors:
>>>
>>>
>>> (b84963e5767747e2): Traceback (most recent call last): File
>>> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
>>> line 581, in do_work work_executor.execute() File
>>> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line
>>> 209, in execute self._split_task) File
>>> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line
>>> 217, in _perform_source_split_considering_api_limits desired_bundle_size)
>>> File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py",
>>> line 254, in _perform_source_split for split in
>>> source.split(desired_bundle_size): File
>>> "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsource.py",
>>> line 172, in split return self._get_concat_source().split( File
>>> "/usr/local/lib/python2.7/dist-packages/apache_beam/options/value_provider.py",
>>> line 109, in _f return fnc(self, *args, **kwargs) File
>>> "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsource.py",
>>> line 119, in _get_concat_source match_result =
>>> FileSystems.match([pattern])[0] File
>>> "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py",
>>> line 130, in match filesystem = FileSystems.get_filesystem(patterns[0])
>>> File
>>> "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filesystems.py",
>>> line 61, in get_filesystem raise ValueError('Unable to get the Filesystem
>>> for path %s' % path) ValueError: Unable to get the Filesystem for path
>>> s3://pm-dataflow/test
>>>
>>>
>>> To my understanding, this means that the module that contains my
>>> S3FileSystem doesn't get imported (it currently resides in a separate
>>> package that I ship to the pipeline). And, I guess, the order of modules
>>> importing may differ depending on the runner, that's the only explanation
>>> of why exactly the same pipeline works on the direct runner and produces
>>> the correct results.
>>>
>>> Any ideas of the best way to troubleshoot this? Can I somehow make Beam
>>> upload some locally-modified version of the apache_beam package to
>>> Dataflow? I'd be happy to provide more details if needed. Or should I
>>> instead be reaching out to the Dataflow team?
>>>
>>> Thank you.
>>>
>>>
>>> --
>>> Best regards,
>>> Dmitry Demeshchuk.
>>>
>>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>

Re: Order of module loading in the Dataflow Python runner

Reply via email to