OK folks, I figured it out.

For the other people desperately clutching to years-old google results in
the hope to find any hint...

The Spark requirement to work with a fat jar caused a collision in the
packaging on file:

META-INF/services/org.apache.hadoop.fs.FileSystem

This in turn erased certain possible drivers between
hadoop-common-2.6.5.jar and hadoop-hdfs-2.6.5.jar depending which one got
packaged first.
For the maven adepts there are plugins that fix the collision.

Right now this issue is gone for me.

Thanks!

Matt
---
Matt Casters <m <[email protected]>[email protected]>
Senior Solution Architect, Kettle Project Founder




Op ma 28 jan. 2019 om 17:08 schreef Matt Casters <[email protected]>:

> Yeah for this setup I used flintrock to start up a bunch of nodes with
> Spark and HDFS on AWS. I'm launching the pipeline on the master and all
> possible HDFS libraries I can think of are available and hdfs dfs commands
> work fine on the master and all the slaves.
> It's a problem of transparency I think where we can't see what going on,
> what's required and so on.
>
> Thanks,
>
> Matt
>
> Op ma 28 jan. 2019 16:14 schreef Juan Carlos Garcia <[email protected]:
>
>> Matt is the machine from where you are launching the pipeline different
>> from where it should run?
>>
>> If that's the case make sure the machine used for launching has all the
>> hdfs environments variable set, as the pipeline is being configured in the
>> launching machine before it hit the worker machine.
>>
>> Good luck
>> JC
>>
>>
>> Am Mo., 28. Jan. 2019, 13:34 hat Matt Casters <[email protected]>
>> geschrieben:
>>
>>> Dear Beam friends,
>>>
>>> In preparation for my presentation of the Kettle Beam work in London
>>> next week I've been trying to get Spark Beam to run which worked in the
>>> end.
>>> The problem that resurfaced is however ... once again... back with a
>>> vengeance :
>>>
>>> java.lang.IllegalArgumentException: No filesystem found for scheme hdfs
>>>
>>>
>>> I configured HADOOP_HOME, HADOOP_CONF_DIR, ran
>>> FileSystems.FileSystems.setDefaultPipelineOptions(pipelineOptions), tried
>>> every trick in the book (very few of those are to be found) but it's a
>>> fairly brutal trial-and-error process.
>>>
>>> Given the fact that I'm not the only person hitting these issues I think
>>> it would be a good idea to allow for some sort of feedback of the
>>> FileSystems loading process, which filesystems it tries to load, which fail
>>> and so on.
>>> Also, the maven library situation is a bit fuzzy in the sense that there
>>> are libraries like beam-sdks-java-io-hdfs on a point release (0.6.0) as
>>> well as beam-sdks-java-io-hadoop-file-system on the latest version.
>>>
>>> I've been expanding my trial-and-error pattern to the endpoint and are
>>> ready to give up on Beam-on-Spark.  I could try to get a Spark test
>>> environment configured for s3:// but I don't think it's all that
>>> representative of real-world scenarios.
>>>
>>> Thanks anyway in advance for any suggestions,
>>>
>>> Matt
>>> ---
>>> Matt Casters <m <[email protected]>[email protected]>
>>> Senior Solution Architect, Kettle Project Founder
>>>
>>>
>>>

Reply via email to