OK folks, I figured it out. For the other people desperately clutching to years-old google results in the hope to find any hint...
The Spark requirement to work with a fat jar caused a collision in the packaging on file: META-INF/services/org.apache.hadoop.fs.FileSystem This in turn erased certain possible drivers between hadoop-common-2.6.5.jar and hadoop-hdfs-2.6.5.jar depending which one got packaged first. For the maven adepts there are plugins that fix the collision. Right now this issue is gone for me. Thanks! Matt --- Matt Casters <m <[email protected]>[email protected]> Senior Solution Architect, Kettle Project Founder Op ma 28 jan. 2019 om 17:08 schreef Matt Casters <[email protected]>: > Yeah for this setup I used flintrock to start up a bunch of nodes with > Spark and HDFS on AWS. I'm launching the pipeline on the master and all > possible HDFS libraries I can think of are available and hdfs dfs commands > work fine on the master and all the slaves. > It's a problem of transparency I think where we can't see what going on, > what's required and so on. > > Thanks, > > Matt > > Op ma 28 jan. 2019 16:14 schreef Juan Carlos Garcia <[email protected]: > >> Matt is the machine from where you are launching the pipeline different >> from where it should run? >> >> If that's the case make sure the machine used for launching has all the >> hdfs environments variable set, as the pipeline is being configured in the >> launching machine before it hit the worker machine. >> >> Good luck >> JC >> >> >> Am Mo., 28. Jan. 2019, 13:34 hat Matt Casters <[email protected]> >> geschrieben: >> >>> Dear Beam friends, >>> >>> In preparation for my presentation of the Kettle Beam work in London >>> next week I've been trying to get Spark Beam to run which worked in the >>> end. >>> The problem that resurfaced is however ... once again... back with a >>> vengeance : >>> >>> java.lang.IllegalArgumentException: No filesystem found for scheme hdfs >>> >>> >>> I configured HADOOP_HOME, HADOOP_CONF_DIR, ran >>> FileSystems.FileSystems.setDefaultPipelineOptions(pipelineOptions), tried >>> every trick in the book (very few of those are to be found) but it's a >>> fairly brutal trial-and-error process. >>> >>> Given the fact that I'm not the only person hitting these issues I think >>> it would be a good idea to allow for some sort of feedback of the >>> FileSystems loading process, which filesystems it tries to load, which fail >>> and so on. >>> Also, the maven library situation is a bit fuzzy in the sense that there >>> are libraries like beam-sdks-java-io-hdfs on a point release (0.6.0) as >>> well as beam-sdks-java-io-hadoop-file-system on the latest version. >>> >>> I've been expanding my trial-and-error pattern to the endpoint and are >>> ready to give up on Beam-on-Spark. I could try to get a Spark test >>> environment configured for s3:// but I don't think it's all that >>> representative of real-world scenarios. >>> >>> Thanks anyway in advance for any suggestions, >>> >>> Matt >>> --- >>> Matt Casters <m <[email protected]>[email protected]> >>> Senior Solution Architect, Kettle Project Founder >>> >>> >>>
