Re: Beam Spark/Flink runner with DC/OS

Amit Sela Wed, 01 Feb 2017 00:02:03 -0800

Not sure what you mean Davor, the only runner I see registering IOs is
DataflowRunner here
<https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L228>.
I'll look into this.


On Tue, Jan 31, 2017 at 11:48 PM Davor Bonaci <da...@apache.org> wrote:

> The file systems are automatically registered locally, so that's why you'd
> be getting an error saying 'already registered'. We need to get it
> registered on your cluster, in the remote execution environment, where you
> are getting an error saying 'Unable to find handler for gs://'.
>
> Amit, any idea from the Spark runner whether IOChannelFactories get
> staged, would you expect AutoService to automatically register them or
> something manual needs to happen?
>
> On Mon, Jan 30, 2017 at 10:11 PM, Chaoran Yu <chaoran...@lightbend.com>
> wrote:
>
> Thank you Davor for the reply!
>
> Your understanding of my problem is exactly right.
>
> I thought about the issue you mentioned. Then I looked at Beam source
> code. It looks to me that IO is done via IOChannelFactory class. And it has
> two subclasses, FileIOChannelFactory and GcsIOChannelFactory. I figured
> probably the wrong class got registered. This link I found
> http://markmail.org/message/mrv4cg4y6bjtdssy points out the same
> registration problem. So I tried registering GcsIOChannelFactory, but got
> the following error:
>
> Scheme: [file] is already registered with class
> org.apache.beam.sdk.util.FileIOChannelFactory
>
> Now I’m not sure what to do..
>
> Thanks for the help!
>
> Chaoran
>
>
>
> On Jan 30, 2017, at 12:05 PM, Davor Bonaci <da...@apache.org> wrote:
>
> Sorry for the delay here.
>
> Am I correct in summarizing that "gs://bucket/file" doesn't work on a
> Spark cluster, but does with Spark runner locally? Beam file systems
> utilize AutoService functionality and, generally speaking, all filesystems
> should be available and automatically registered on all runners. This is
> probably just a simple matter of staging the right classes on the cluster.
>
> Pei, any additional thoughts here?
>
> On Mon, Jan 23, 2017 at 1:58 PM, Chaoran Yu <chaoran...@lightbend.com>
> wrote:
>
> Sorry for the spam. But to clarify, I didn’t write the code. I’m using the
> code described here https://beam.apache.org/get-started/wordcount-example/
> So the file already exists in GS.
>
> On Jan 23, 2017, at 4:55 PM, Chaoran Yu <chaoran...@lightbend.com> wrote:
>
> I didn’t upload the file. But since the identical Beam code, when running
> in Spark local mode, was able to fetch the file and process it, the file
> does exist.
> It’s just that somehow Spark standalone mode can’t find the file.
>
>
> On Jan 23, 2017, at 4:50 PM, Amit Sela <amitsel...@gmail.com> wrote:
>
> I think "external" is the key here, you're cluster is running all it's
> components on your local machine so you're good.
>
> As for GS, it's like Amazon's S3 or sort-of a cloud service HDFS offered
> by Google. You need to upload your file to GS. Have you ?
>
> On Mon, Jan 23, 2017 at 11:47 PM Chaoran Yu <chaoran...@lightbend.com>
> wrote:
>
> Well, my file is not in my local filesystem. It’s in GS.
> This is the line of code that reads the input file: p.apply(TextIO.Read.
> from("gs://apache-beam-samples/shakespeare/*"))
>
> And this page https://beam.apache.org/get-started/quickstart/ says the
> following:
> "you can’t access a local file if you are running the pipeline on an
> external cluster”.
> I’m indeed trying to run a pipeline on a standalone Spark cluster running
> on my local machine. So local files are not an option.
>
>
> On Jan 23, 2017, at 4:41 PM, Amit Sela <amitsel...@gmail.com> wrote:
>
> Why not try file:// instead ? it doesn't seem like you're using Google
> Storage, right ? I mean the input file is on your local FS.
>
> On Mon, Jan 23, 2017 at 11:34 PM Chaoran Yu <chaoran...@lightbend.com>
> wrote:
>
> No I’m not using Dataproc.
> I’m simply running on my local machine. I started a local Spark cluster
> with sbin/start-master.sh and sbin/start-slave.sh. Then I submitted my Beam
> job to that cluster.
> The gs file is the kinglear.txt from Beam’s example code and it should be
> public.
>
> My full stack trace is attached.
>
> Thanks,
> Chaoran
>
>
>
> On Jan 23, 2017, at 4:23 PM, Amit Sela <amitsel...@gmail.com> wrote:
>
> Maybe, are you running on Dataproc ? are you using YARN/Mesos ? do the
> machines hosting the executor processes have access to GS ? could you paste
> the entire stack trace ?
>
> On Mon, Jan 23, 2017 at 11:21 PM Chaoran Yu <chaoran...@lightbend.com>
> wrote:
>
> Thank you Amit for the reply,
>
> I just tried two more runners and below is a summary:
>
> DirectRunner: works
> FlinkRunner: works in local mode. I got an error “Communication with
> JobManager failed: lost connection to the JobManager” when running in
> cluster mode,
> SparkRunner: works in local mode (mvn exec command) but fails in cluster
> mode (spark-submit) with the error I pasted in the previous email.
>
> In SparkRunner’s case, can it be that Spark executor can’t access gs file
> in Google Storage?
>
> Thank you,
>
>
>
> On Jan 23, 2017, at 3:28 PM, Amit Sela <amitsel...@gmail.com> wrote:
>
> Is this working for you with other runners ? judging by the stack trace,
> it seems like IOChannelUtils fails to find a handler so it doesn't seem
> like it is a Spark specific problem.
>
> On Mon, Jan 23, 2017 at 8:50 PM Chaoran Yu <chaoran...@lightbend.com>
> wrote:
>
> Thank you Amit and JB!
>
> This is not related to DC/OS itself, but I ran into a problem when
> launching a Spark job on a cluster with spark-submit. My Spark job written
> in Beam can’t read the specified gs file. I got the following error:
>
> Caused by: java.io.IOException: Unable to find handler for
> gs://beam-samples/sample.txt
> at
> org.apache.beam.sdk.util.IOChannelUtils.getFactory(IOChannelUtils.java:307)
> at org.apache.beam.sdk.io
> .FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:528)
> at org.apache.beam.sdk.io
> .OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:271)
> at
> org.apache.beam.runners.spark.io.SourceRDD$Bounded$1.hasNext(SourceRDD.java:125)
>
> Then I thought about switching to reading from another source, but I saw
> in Beam’s documentation that TextIO can only read from files in Google
> Cloud Storage (prefixed with gs://) when running in cluster mode. How do
> you guys doing file IO in Beam when using the SparkRunner?
>
>
> Thank you,
> Chaoran
>
>
> On Jan 22, 2017, at 4:32 AM, Amit Sela <amitsel...@gmail.com> wrote:
>
> I'lll join JB's comment on the Spark runner saying that submitting Beam
> pipelines using the Spark runner can be done using Spark's spark-submit
> script, find out more in the Spark runner documentation
> <https://beam.apache.org/documentation/runners/spark/>.
>
> Amit.
>
> On Sun, Jan 22, 2017 at 8:03 AM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> Hi,
>
> Not directly DCOS (I think Stephen did some test on it), but I have a
> platform running Spark and Flink with Beam on Mesos + Marathon.
>
> It basically doesn't have anything special as running piplines uses
> spark-submit (as on in Spark "natively").
>
> Regards
> JB
>
> On 01/22/2017 12:56 AM, Chaoran Yu wrote:
> > Hello all,
> >
> >   Has anyone had experience using Beam on DC/OS? I want to run Beam code
> >
> > executed with Spark runner on DC/OS. As a next step, I would like to run
> the
> >
> > Flink runner as well. There doesn't seem to exist any information
> > about running
> >
> > Beam on DC/OS I can find on the web. So some pointers are greatly
> > appreciated.
> >
> > Thank you,
> >
> > Chaoran Yu
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
>
>
>
>
>
>
>
>
>

Re: Beam Spark/Flink runner with DC/OS

Reply via email to