No problem, happy to help ;)

Regards
JB

On 01/10/2018 11:15 AM, Shashank Prabhakara wrote:
It was the missing hdfs filesystem extension dependency. Thanks Jean-Baptiste. Much appreciated.

Regards,
Shashank

On Wed, Jan 10, 2018 at 2:09 PM, Jean-Baptiste Onofré <[email protected] <mailto:[email protected]>> wrote:

    Hi

    Do you have the beam hdfs filesystem extension in the dependencies ? Did you
    define the HADOOP_CONF_DIR env variable containing path to hdfs-site.xml ?

    Regards
    JB


    On 01/10/2018 08:55 AM, Shashank Prabhakara wrote:

        Hello,

        I'm testing some pipelines on a dataproc cluster with hadoop version
        2.8.2, beam 2.3.0-SNAPSHOT.
        I have observed on our pipeline as well as the wordcount that ships with
        beam, that FileBasedSource does not "match" any files when using hdfs
        prefix - verified this with apex runner and direct runner. Local fs and
        GoogleHadoopFileSystem work fine. HDFS files access is verified from all
        worker nodes for all users from cli.

        In the logs (console for direct runner, apex.log from one of the
        containers for apex runner):
        INFO org.apache.beam.sdk.io
        <http://org.apache.beam.sdk.io>.FileBasedSource: Matched 0 files for
        pattern hdfs:///tmp/input/

        Tried numerous versions of the same uri. For example:
        INFO org.apache.beam.sdk.io
        <http://org.apache.beam.sdk.io>.FileBasedSource: Matched 0 files for
        pattern hdfs://cluster-m/tmp/input/twitter.avro
        INFO org.apache.beam.sdk.io
        <http://org.apache.beam.sdk.io>.FileBasedSource: Matched 0 files for
        pattern hdfs://mycluster-m/tmp/input/twitter.avro

        Works for gcs files:
        INFO org.apache.beam.sdk.io
        <http://org.apache.beam.sdk.io>.FileBasedSource: Matched 1 files for
        pattern gs://mybucket/input/twitter/twitter.avro



        To reproduce, use beam examples archetype, package and execute:

        mvn archetype:generate
        
-DarchetypeRepository=https://repository.apache.org/content/groups/snapshots
        <https://repository.apache.org/content/groups/snapshots>
        -DarchetypeGroupId=org.apache.beam
        -DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples
        -DarchetypeVersion=LATEST -DgroupId=org.example
        -DartifactId=word-count-beam -Dversion="0.1"
        -Dpackage=org.apache.beam.examples -DinteractiveMode=false

        cd word-count-beam
        mvn clean package -Papex-runner -DskipTests

        yarn jar target/word-count-beam-bundled-0.1.jar
        org.apache.beam.examples.WordCount --inputFile=hdfs:///tmp/input/pom.xml
        --output=/tmp/output --runner=ApexRunner --embeddedExecution=false


        Note: "mvn compile exec:java ..." would not work for me due to
        classpath/version-compat issues. Also needed to exclude
        org.apache.hadoop:* and com.google.cloud.bigdataoss:* from shaded jar
        for version compat.

        Appreciate any help.

        Regards,
        Shashank


-- Jean-Baptiste Onofré
    [email protected] <mailto:[email protected]>
    http://blog.nanthrax.net
    Talend - http://www.talend.com



--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to