Re: Beam Spark/Flink runner with DC/OS

Chaoran Yu Mon, 23 Jan 2017 13:47:37 -0800

Well, my file is not in my local filesystem. It’s in GS. 
This is the line of code that reads the input file: 
p.apply(TextIO.Read.from("gs://apache-beam-samples/shakespeare/*"))


And this page https://beam.apache.org/get-started/quickstart/ 
<https://beam.apache.org/get-started/quickstart/> says the following:
"you can’t access a local file if you are running the pipeline on an external 
cluster”.
I’m indeed trying to run a pipeline on a standalone Spark cluster running on my 
local machine. So local files are not an option.


> On Jan 23, 2017, at 4:41 PM, Amit Sela <amitsel...@gmail.com> wrote:
> 
> Why not try file:// instead ? it doesn't seem like you're using Google 
> Storage, right ? I mean the input file is on your local FS.
> 
> On Mon, Jan 23, 2017 at 11:34 PM Chaoran Yu <chaoran...@lightbend.com 
> <mailto:chaoran...@lightbend.com>> wrote:
> No I’m not using Dataproc.
> I’m simply running on my local machine. I started a local Spark cluster with 
> sbin/start-master.sh and sbin/start-slave.sh. Then I submitted my Beam job to 
> that cluster.
> The gs file is the kinglear.txt from Beam’s example code and it should be 
> public. 
> 
> My full stack trace is attached.
> 
> Thanks,
> Chaoran
> 
> 
> 
>> On Jan 23, 2017, at 4:23 PM, Amit Sela <amitsel...@gmail.com 
>> <mailto:amitsel...@gmail.com>> wrote:
>> 
>> Maybe, are you running on Dataproc ? are you using YARN/Mesos ? do the 
>> machines hosting the executor processes have access to GS ? could you paste 
>> the entire stack trace ?
>> 
>> On Mon, Jan 23, 2017 at 11:21 PM Chaoran Yu <chaoran...@lightbend.com 
>> <mailto:chaoran...@lightbend.com>> wrote:
>> Thank you Amit for the reply,
>> 
>> I just tried two more runners and below is a summary:
>> 
>> DirectRunner: works
>> FlinkRunner: works in local mode. I got an error “Communication with 
>> JobManager failed: lost connection to the JobManager” when running in 
>> cluster mode, 
>> SparkRunner: works in local mode (mvn exec command) but fails in cluster 
>> mode (spark-submit) with the error I pasted in the previous email.
>> 
>> In SparkRunner’s case, can it be that Spark executor can’t access gs file in 
>> Google Storage?
>> 
>> Thank you,
>> 
>> 
>> 
>>> On Jan 23, 2017, at 3:28 PM, Amit Sela <amitsel...@gmail.com 
>>> <mailto:amitsel...@gmail.com>> wrote:
>>> 
>>> Is this working for you with other runners ? judging by the stack trace, it 
>>> seems like IOChannelUtils fails to find a handler so it doesn't seem like 
>>> it is a Spark specific problem. 
>>> 
>>> On Mon, Jan 23, 2017 at 8:50 PM Chaoran Yu <chaoran...@lightbend.com 
>>> <mailto:chaoran...@lightbend.com>> wrote:
>>> Thank you Amit and JB! 
>>> 
>>> This is not related to DC/OS itself, but I ran into a problem when 
>>> launching a Spark job on a cluster with spark-submit. My Spark job written 
>>> in Beam can’t read the specified gs file. I got the following error:
>>> 
>>> Caused by: java.io.IOException: Unable to find handler for 
>>> gs://beam-samples/sample.txt <>
>>>     at 
>>> org.apache.beam.sdk.util.IOChannelUtils.getFactory(IOChannelUtils.java:307)
>>>     at 
>>> org.apache.beam.sdk.io.FileBasedSource$FileBasedReader.startImpl(FileBasedSource.java:528)
>>>     at 
>>> org.apache.beam.sdk.io.OffsetBasedSource$OffsetBasedReader.start(OffsetBasedSource.java:271)
>>>     at 
>>> org.apache.beam.runners.spark.io.SourceRDD$Bounded$1.hasNext(SourceRDD.java:125)
>>> 
>>> Then I thought about switching to reading from another source, but I saw in 
>>> Beam’s documentation that TextIO can only read from files in Google Cloud 
>>> Storage (prefixed with gs://) when running in cluster mode. How do you guys 
>>> doing file IO in Beam when using the SparkRunner?
>>> 
>>> 
>>> Thank you,
>>> Chaoran
>>> 
>>> 
>>>> On Jan 22, 2017, at 4:32 AM, Amit Sela <amitsel...@gmail.com 
>>>> <mailto:amitsel...@gmail.com>> wrote:
>>>> 
>>>> I'lll join JB's comment on the Spark runner saying that submitting Beam 
>>>> pipelines using the Spark runner can be done using Spark's spark-submit 
>>>> script, find out more in the Spark runner documentation 
>>>> <https://beam.apache.org/documentation/runners/spark/>.
>>>> 
>>>> Amit.
>>>> 
>>>> On Sun, Jan 22, 2017 at 8:03 AM Jean-Baptiste Onofré <j...@nanthrax.net 
>>>> <mailto:j...@nanthrax.net>> wrote:
>>>> Hi,
>>>> 
>>>> Not directly DCOS (I think Stephen did some test on it), but I have a
>>>> platform running Spark and Flink with Beam on Mesos + Marathon.
>>>> 
>>>> It basically doesn't have anything special as running piplines uses
>>>> spark-submit (as on in Spark "natively").
>>>> 
>>>> Regards
>>>> JB
>>>> 
>>>> On 01/22/2017 12:56 AM, Chaoran Yu wrote:
>>>> > Hello all,
>>>> >
>>>> >   Has anyone had experience using Beam on DC/OS? I want to run Beam code
>>>> >
>>>> > executed with Spark runner on DC/OS. As a next step, I would like to run 
>>>> > the
>>>> >
>>>> > Flink runner as well. There doesn't seem to exist any information
>>>> > about running
>>>> >
>>>> > Beam on DC/OS I can find on the web. So some pointers are greatly
>>>> > appreciated.
>>>> >
>>>> > Thank you,
>>>> >
>>>> > Chaoran Yu
>>>> >
>>>> 
>>>> --
>>>> Jean-Baptiste Onofré
>>>> jbono...@apache.org <mailto:jbono...@apache.org>
>>>> http://blog.nanthrax.net <http://blog.nanthrax.net/>
>>>> Talend - http://www.talend.com <http://www.talend.com/>
>>> 
>> 
>

Re: Beam Spark/Flink runner with DC/OS

Reply via email to