Sorry for the delay here.

Am I correct in summarizing that "gs://bucket/file" doesn't work on a Spark
cluster, but does with Spark runner locally? Beam file systems utilize
AutoService functionality and, generally speaking, all filesystems should
be available and automatically registered on all runners. This is probably
just a simple matter of staging the right classes on the cluster.

Pei, any additional thoughts here?

On Mon, Jan 23, 2017 at 1:58 PM, Chaoran Yu <chaoran...@lightbend.com>
wrote:

> Sorry for the spam. But to clarify, I didn’t write the code. I’m using the
> code described here https://beam.apache.org/get-started/wordcount-example/
> So the file already exists in GS.
>
> On Jan 23, 2017, at 4:55 PM, Chaoran Yu <chaoran...@lightbend.com> wrote:
>
> I didn’t upload the file. But since the identical Beam code, when running
> in Spark local mode, was able to fetch the file and process it, the file
> does exist.
> It’s just that somehow Spark standalone mode can’t find the file.
>
>
> On Jan 23, 2017, at 4:50 PM, Amit Sela <amitsel...@gmail.com> wrote:
>
> I think "external" is the key here, you're cluster is running all it's
> components on your local machine so you're good.
>
> As for GS, it's like Amazon's S3 or sort-of a cloud service HDFS offered
> by Google. You need to upload your file to GS. Have you ?
>
> On Mon, Jan 23, 2017 at 11:47 PM Chaoran Yu <chaoran...@lightbend.com>
> wrote:
>
>> Well, my file is not in my local filesystem. It’s in GS.
>> This is the line of code that reads the input file: p.apply(TextIO.Read.
>> from("gs://apache-beam-samples/shakespeare/*"))
>>
>> And this page https://beam.apache.org/get-started/quickstart/ says the
>> following:
>> "you can’t access a local file if you are running the pipeline on an
>> external cluster”.
>> I’m indeed trying to run a pipeline on a standalone Spark cluster running
>> on my local machine. So local files are not an option.
>>
>>
>> On Jan 23, 2017, at 4:41 PM, Amit Sela <amitsel...@gmail.com> wrote:
>>
>> Why not try file:// instead ? it doesn't seem like you're using Google
>> Storage, right ? I mean the input file is on your local FS.
>>
>> On Mon, Jan 23, 2017 at 11:34 PM Chaoran Yu <chaoran...@lightbend.com>
>> wrote:
>>
>> No I’m not using Dataproc.
>> I’m simply running on my local machine. I started a local Spark cluster
>> with sbin/start-master.sh and sbin/start-slave.sh. Then I submitted my Beam
>> job to that cluster.
>> The gs file is the kinglear.txt from Beam’s example code and it should be
>> public.
>>
>> My full stack trace is attached.
>>
>> Thanks,
>> Chaoran
>>
>>
>>
>> On Jan 23, 2017, at 4:23 PM, Amit Sela <amitsel...@gmail.com> wrote:
>>
>> Maybe, are you running on Dataproc ? are you using YARN/Mesos ? do the
>> machines hosting the executor processes have access to GS ? could you paste
>> the entire stack trace ?
>>
>> On Mon, Jan 23, 2017 at 11:21 PM Chaoran Yu <chaoran...@lightbend.com>
>> wrote:
>>
>> Thank you Amit for the reply,
>>
>> I just tried two more runners and below is a summary:
>>
>> DirectRunner: works
>> FlinkRunner: works in local mode. I got an error “Communication with
>> JobManager failed: lost connection to the JobManager” when running in
>> cluster mode,
>> SparkRunner: works in local mode (mvn exec command) but fails in cluster
>> mode (spark-submit) with the error I pasted in the previous email.
>>
>> In SparkRunner’s case, can it be that Spark executor can’t access gs file
>> in Google Storage?
>>
>> Thank you,
>>
>>
>>
>> On Jan 23, 2017, at 3:28 PM, Amit Sela <amitsel...@gmail.com> wrote:
>>
>> Is this working for you with other runners ? judging by the stack trace,
>> it seems like IOChannelUtils fails to find a handler so it doesn't seem
>> like it is a Spark specific problem.
>>
>> On Mon, Jan 23, 2017 at 8:50 PM Chaoran Yu <chaoran...@lightbend.com>
>> wrote:
>>
>> Thank you Amit and JB!
>>
>> This is not related to DC/OS itself, but I ran into a problem when
>> launching a Spark job on a cluster with spark-submit. My Spark job written
>> in Beam can’t read the specified gs file. I got the following error:
>>
>> Caused by: java.io.IOException: Unable to find handler for
>> gs://beam-samples/sample.txt
>> at org.apache.beam.sdk.util.IOChannelUtils.getFactory(
>> IOChannelUtils.java:307)
>> at org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(
>> FileBasedSource.java:528)
>> at org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(
>> OffsetBasedSource.java:271)
>> at org.apache.beam.runners.spark.io.SourceRDD$Bounded$1.
>> hasNext(SourceRDD.java:125)
>>
>> Then I thought about switching to reading from another source, but I saw
>> in Beam’s documentation that TextIO can only read from files in Google
>> Cloud Storage (prefixed with gs://) when running in cluster mode. How do
>> you guys doing file IO in Beam when using the SparkRunner?
>>
>>
>> Thank you,
>> Chaoran
>>
>>
>> On Jan 22, 2017, at 4:32 AM, Amit Sela <amitsel...@gmail.com> wrote:
>>
>> I'lll join JB's comment on the Spark runner saying that submitting Beam
>> pipelines using the Spark runner can be done using Spark's spark-submit
>> script, find out more in the Spark runner documentation
>> <https://beam.apache.org/documentation/runners/spark/>.
>>
>> Amit.
>>
>> On Sun, Jan 22, 2017 at 8:03 AM Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>
>> Hi,
>>
>> Not directly DCOS (I think Stephen did some test on it), but I have a
>> platform running Spark and Flink with Beam on Mesos + Marathon.
>>
>> It basically doesn't have anything special as running piplines uses
>> spark-submit (as on in Spark "natively").
>>
>> Regards
>> JB
>>
>> On 01/22/2017 12:56 AM, Chaoran Yu wrote:
>> > Hello all,
>> >
>> >   Has anyone had experience using Beam on DC/OS? I want to run Beam code
>> >
>> > executed with Spark runner on DC/OS. As a next step, I would like to
>> run the
>> >
>> > Flink runner as well. There doesn't seem to exist any information
>> > about running
>> >
>> > Beam on DC/OS I can find on the web. So some pointers are greatly
>> > appreciated.
>> >
>> > Thank you,
>> >
>> > Chaoran Yu
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>
>>
>>
>>
>>
>
>

Reply via email to