Re: Beam support on Spark 2.x

2017-11-13 Thread Artur Mrozowski
Ok, thank you.

On Mon, Nov 13, 2017 at 7:20 AM, Jean-Baptiste Onofré 
wrote:

> Hi Artur,
>
> I was talking about my "own" beam-samples that I'm using for the tests:
>
> https://github.com/jbonofre/beam-samples
>
> It's already possible to run on Spark 1.6 using Spark runner provided in
> Beam 2.1.0.
>
> For Spark 2.0, you will have to wait Spark runner that will be provided in
> Beam 2.3.0.
>
> Regards
> JB
>
> On 11/13/2017 06:50 AM, Artur Mrozowski wrote:
>
>> Hi Jean-Baptiste,
>> that's great news. When you mention beam-sample are you then referring to
>> gaming examples? https://github.com/eljefe6a/beamexample
>>
>> Those examples  cover a lot of what we try to achieve in our poc so it's
>> just great.  Should it possible to run these on both 1.6 and 2.0 versions
>> of Spark?
>>
>> We prefer 2.0 version of Spark.
>>
>> Best Regards
>> Artur
>>
>> On Fri, Nov 10, 2017 at 1:54 PM, Jean-Baptiste Onofré > > wrote:
>>
>> Hi,
>>
>> I guess you are not following the dev mailing list.
>>
>> Spark runner supports almost all transforms and yes, you can fully
>> use Spark
>> runner to run your pipelines.
>>
>> PCollection is represented with RDD and it's currently Spark 1.x.
>>
>> I'm working on the Spark 2.x support (still using RDD): we have a
>> vote in
>> progress on the mailing list if we want to support both Spark 1.x &
>> Spark
>> 2.x or just upgrade to Spark 2.x and drop support for Spark 1.x.
>>
>> You can take a look on the beam-samples: they all run using the Spark
>> runner.
>>
>> Regards
>> JB
>>
>>
>> On 11/10/2017 01:46 PM, Artur Mrozowski wrote:
>>
>> Hi,
>> I have seen the compatibility matrix and I realize that Spark is
>> not the
>> most supported runner.
>> I am curious if it is possible to run a pipeline on Spark, say
>> with
>> global windows, after processing triggers and group by
>> key(CoGroupByKye,
>> CombineByKey) . We have definitely problems to execute a pipeline
>> that
>> successfully runs on direct runner.
>>
>> Is that a known issue? Is Flink the best option?
>>
>> Best Regards
>> Artur
>>
>>
>> -- Jean-Baptiste Onofré
>> jbono...@apache.org 
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [DISCUSS] Drop Spark 1.x support to focus on Spark 2.x

2017-11-13 Thread Jean-Baptiste Onofré

Hi Tim,

Basically, if an user still wants to use Spark 1.x, he would just be "stuck" 
with Beam 2.2.0.


I would like to see a Beam 2.3.0 end of December/beginning of January with Spark 
2.x support (exclusive or with 1.x).


The goal of the discussion is just to know if it's worth to maintain Spark 1.x 
and 2.x or if I can do cleanup to support only 2.x ;)


Regards
JB

On 11/13/2017 10:56 AM, Tim Robertson wrote:

Thanks JB

On "thoughts":

- Cloudera 5.13 will still default to 1.6 even though a 2.2 parcel is available 
(HWX provides both)
- Cloudera support for spark 2 has a list of exceptions 
(https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html)

   - I am not sure if the HBaseIO would be affected
   - I am not sure if structured streaming would have implications
   - it might stop customers from being able to run spark 2 at all due to 
support agreements

- Spark 2.3 (EOY) will drop Scala 2.10 support
- IBM's now defunct distro only has 1.6
- Oozie doesn't have a spark 2 action (need to use a shell action)
- There are a lot of folks with code running on 1.3,1.4 and 1.5 still
- Spark 2.2+ requires Java 8 too, while <2.2 was J7 like Beam (not sure if this 
has other implications for the cross deployment nature of Beam)


My first impressions of Beam was really favourable as it all just worked first 
time on a CDH Spark 1.6 cluster.  For us it is lacking resources to refactor 
legacy code which delays the 2.2 push.


With that said I think is it very reasonable to have a clear cut off in Beam, 
especially if it limits progress / causes headaches in packaging, robustness 
etc.  I'd recommend putting it in a 6 month timeframe which might align with 2.3?


Hope this helps,
Tim











On Mon, Nov 13, 2017 at 10:07 AM, Neville Dipale > wrote:


Hi JB,


   [X ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
   [ ] 0 (I don't care ;))
   [ ] -1, I would like to still support Spark 1.x, and so having support of
both Spark 1.x and 2.x (please provide specific comment)

On 13 November 2017 at 10:32, Jean-Baptiste Onofré mailto:j...@nanthrax.net>> wrote:

Hi Beamers,

I'm forwarding this discussion & vote from the dev mailing list to the
user mailing list.
The goal is to have your feedback as user.

Basically, we have two options:
1. Right now, in the PR, we support both Spark 1.x and 2.x using three
artifacts (common, spark1, spark2). You, as users, pick up spark1 or
spark2 in your dependencies set depending the Spark target version you 
want.
2. The other option is to upgrade and focus on Spark 2.x in Beam 2.3.0.
If you still want to use Spark 1.x, then, you will be stuck up to Beam
2.2.0.

Thoughts ?

Thanks !
Regards
JB


 Forwarded Message 
Subject: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
Date: Wed, 8 Nov 2017 08:27:58 +0100
From: Jean-Baptiste Onofré mailto:j...@nanthrax.net>>
Reply-To: d...@beam.apache.org 
To: d...@beam.apache.org 

Hi all,

as you might know, we are working on Spark 2.x support in the Spark 
runner.

I'm working on a PR about that:

https://github.com/apache/beam/pull/3808


Today, we have something working with both Spark 1.x and 2.x from a code
standpoint, but I have to deal with dependencies. It's the first step of
the update as I'm still using RDD, the second step would be to support
dataframe (but for that, I would need PCollection elements with schemas,
that's another topic on which Eugene, Reuven and I are discussing).

However, as all major distributions now ship Spark 2.x, I don't think
it's required anymore to support Spark 1.x.

If we agree, I will update and cleanup the PR to only support and focus
on Spark 2.x.

So, that's why I'm calling for a vote:

   [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
   [ ] 0 (I don't care ;))
   [ ] -1, I would like to still support Spark 1.x, and so having
support of both Spark 1.x and 2.x (please provide specific comment)

This vote is open for 48 hours (I have the commits ready, just waiting
the end of the vote to push on the PR).

Thanks !
Regards
JB
-- 
Jean-Baptiste Onofré

jbono...@apache.org 
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Apache Beam and Spark runner

2017-11-13 Thread Jean-Baptiste Onofré

See my answers on the dev mailing list.

NB: no need to "flood" both mailing lists ;)

Regards
JB

On 11/13/2017 10:56 AM, Nishu wrote:

Hi ,

I am writing a streaming pipeline in Apache beam using spark runner.
Use case : To join the multiple kafka streams using windowed collections.  I use 
GroupByKey to group the events based on common business key and that output is 
used as input for Join operation. Pipeline run on direct runner as expected but 
on Spark cluster(v2.1), it throws the Accumulator error.
*"Exception in thread "main" java.lang.AssertionError: assertion failed: 
copyAndReset must return a zero value copy"*

*
*
I tried the same pipeline on Spark cluster(v1.6), there it runs without any 
error but doesn't perform the join operations on the streams .


I got couple of questions.

1. Does spark runner support spark version 2.x?

2. Regarding the triggers, currently only ProcessingTimeTrigger is supported in 
Capability Matrix 
 , 
can we expect to have support for more trigger in near future sometime soon ? 
Also, GroupByKey and Accumulating panes features, are those supported for spark 
for streaming pipeline?


3. According to the documentation, Storage level 
 is 
set to IN_MEMORY for streaming pipelines. Can we configure it to disk as well?


4. Is there checkpointing feature supported for Spark runner? In case if Beam 
pipeline fails unexpectedly, can we read the state from the last run.


It will be great if someone could help to know above.

--
Thanks & Regards,
Nishu Tayal


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [DISCUSS] Drop Spark 1.x support to focus on Spark 2.x

2017-11-13 Thread Tim Robertson
Thanks JB

On "thoughts":

- Cloudera 5.13 will still default to 1.6 even though a 2.2 parcel is
available (HWX provides both)
- Cloudera support for spark 2 has a list of exceptions (
https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html
)
  - I am not sure if the HBaseIO would be affected
  - I am not sure if structured streaming would have implications
  - it might stop customers from being able to run spark 2 at all due to
support agreements
- Spark 2.3 (EOY) will drop Scala 2.10 support
- IBM's now defunct distro only has 1.6
- Oozie doesn't have a spark 2 action (need to use a shell action)
- There are a lot of folks with code running on 1.3,1.4 and 1.5 still
- Spark 2.2+ requires Java 8 too, while <2.2 was J7 like Beam (not sure if
this has other implications for the cross deployment nature of Beam)

My first impressions of Beam was really favourable as it all just worked
first time on a CDH Spark 1.6 cluster.  For us it is lacking resources to
refactor legacy code which delays the 2.2 push.

With that said I think is it very reasonable to have a clear cut off in
Beam, especially if it limits progress / causes headaches in packaging,
robustness etc.  I'd recommend putting it in a 6 month timeframe which
might align with 2.3?

Hope this helps,
Tim











On Mon, Nov 13, 2017 at 10:07 AM, Neville Dipale 
wrote:

> Hi JB,
>
>
>   [X ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
>   [ ] 0 (I don't care ;))
>   [ ] -1, I would like to still support Spark 1.x, and so having support
> of both Spark 1.x and 2.x (please provide specific comment)
>
> On 13 November 2017 at 10:32, Jean-Baptiste Onofré 
> wrote:
>
>> Hi Beamers,
>>
>> I'm forwarding this discussion & vote from the dev mailing list to the
>> user mailing list.
>> The goal is to have your feedback as user.
>>
>> Basically, we have two options:
>> 1. Right now, in the PR, we support both Spark 1.x and 2.x using three
>> artifacts (common, spark1, spark2). You, as users, pick up spark1 or spark2
>> in your dependencies set depending the Spark target version you want.
>> 2. The other option is to upgrade and focus on Spark 2.x in Beam 2.3.0.
>> If you still want to use Spark 1.x, then, you will be stuck up to Beam
>> 2.2.0.
>>
>> Thoughts ?
>>
>> Thanks !
>> Regards
>> JB
>>
>>
>>  Forwarded Message 
>> Subject: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
>> Date: Wed, 8 Nov 2017 08:27:58 +0100
>> From: Jean-Baptiste Onofré 
>> Reply-To: d...@beam.apache.org
>> To: d...@beam.apache.org
>>
>> Hi all,
>>
>> as you might know, we are working on Spark 2.x support in the Spark
>> runner.
>>
>> I'm working on a PR about that:
>>
>> https://github.com/apache/beam/pull/3808
>>
>> Today, we have something working with both Spark 1.x and 2.x from a code
>> standpoint, but I have to deal with dependencies. It's the first step of
>> the update as I'm still using RDD, the second step would be to support
>> dataframe (but for that, I would need PCollection elements with schemas,
>> that's another topic on which Eugene, Reuven and I are discussing).
>>
>> However, as all major distributions now ship Spark 2.x, I don't think
>> it's required anymore to support Spark 1.x.
>>
>> If we agree, I will update and cleanup the PR to only support and focus
>> on Spark 2.x.
>>
>> So, that's why I'm calling for a vote:
>>
>>   [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
>>   [ ] 0 (I don't care ;))
>>   [ ] -1, I would like to still support Spark 1.x, and so having support
>> of both Spark 1.x and 2.x (please provide specific comment)
>>
>> This vote is open for 48 hours (I have the commits ready, just waiting
>> the end of the vote to push on the PR).
>>
>> Thanks !
>> Regards
>> JB
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>


Apache Beam and Spark runner

2017-11-13 Thread Nishu
Hi ,

I am writing a streaming pipeline in Apache beam using spark runner.
Use case : To join the multiple kafka streams using windowed collections.
I use GroupByKey to group the events based on common business key and that
output is used as input for Join operation. Pipeline run on direct runner
as expected but on Spark cluster(v2.1), it throws the Accumulator error.
*"Exception in thread "main" java.lang.AssertionError: assertion failed:
copyAndReset must return a zero value copy"*

I tried the same pipeline on Spark cluster(v1.6), there it runs without any
error but doesn't perform the join operations on the streams .

I got couple of questions.

1. Does spark runner support spark version 2.x?

2. Regarding the triggers, currently only ProcessingTimeTrigger is
supported in Capability Matrix

,
can we expect to have support for more trigger in near future sometime soon
? Also, GroupByKey and Accumulating panes features, are those supported for
spark for streaming pipeline?

3. According to the documentation, Storage level

is
set to IN_MEMORY for streaming pipelines. Can we configure it to disk as
well?

4. Is there checkpointing feature supported for Spark runner? In case if
Beam pipeline fails unexpectedly, can we read the state from the last run.

It will be great if someone could help to know above.

-- 
Thanks & Regards,
Nishu Tayal


Re: PubsubIO unable to set topic and subscription

2017-11-13 Thread Vicky Fazlurrahman
I already have a subscription for a pub sub topic. I just want to read from
the created subscription.
I See. My mistake. Its because i set both of fromTopic and fromSubscription at
the same time.

Thank you very much Eugene.

Regards,
Vicky

On Mon, Nov 13, 2017 at 3:58 PM, Eugene Kirpichov 
wrote:

> The error says "Can't set both the topic and the subscription": PubSub
> subscribers read from a subscription, and messages sent to a topic are sent
> to all subscriptions bound to this topic. That's why, when reading from
> PubSub in Beam you can specify either a topic (then a new subscription to
> this topic will be created specifically for this pipeline) or a
> subscription (if you already have a subscription you want to use for this
> pipeline). What are you trying to accomplish by specifying both at the same
> time?
>
> On Mon, Nov 13, 2017, 12:08 AM vicky fazlurrahman <
> vicky.fazlurrah...@gmail.com> wrote:
>
>>
>> Hi all,
>>
>> I am using beam sdk 2.1.0, trying to run streaming pipeline, using
>> dataflow-runner that consume pub sub stream with the following config :
>>
>> p.apply(PubsubIO.readStrings()
>> .fromTopic("projects/my-gcp-project/topics/publisher-test")
>> .fromSubscription("projects/my-gcp-project/subscriptions/
>> subscibe-test"))
>>
>> The build is failed because it unable to set the topic and subscription
>>
>> [DEBUG] Setting accessibility to true in order to invoke main().
>>
>> Nov 13, 2017 2:23:03 PM org.apache.beam.runners.dataflow.DataflowRunner
>> fromOptions
>> INFO: PipelineOptions.filesToStage was not specified. Defaulting to files
>> from the classpath: will stage 111 files. Enable logging at DEBUG level to
>> see which files will be staged.
>> [WARNING]
>> java.lang.reflect.InvocationTargetException
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(
>> NativeMethodAccessorImpl.java:62)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>> DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at org.codehaus.mojo.exec.ExecJavaMojo$1.run(
>> ExecJavaMojo.java:293)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.lang.IllegalStateException: Can't set both the topic and
>> the subscription for a PubsubIO.Read transform
>> at org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Read.expand(
>> PubsubIO.java:702)
>> at org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Read.expand(
>> PubsubIO.java:536)
>> at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:514)
>> at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:454)
>> at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:44)
>> at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:165)
>> at com.example.drain.DrainPubsub.main(DrainPubsub.java:67)
>> ... 6 more
>> [INFO] 
>> 
>> [INFO] BUILD FAILURE
>> [INFO] 
>> 
>>
>> A bit confused, about the cause of the exception and what is the proper
>> config for PubsubIO.
>> Because seems, it can't get topic and subscription although given the
>> correct format :
>>
>> projects//topics/
>>
>> projects//subscriptions/
>>
>> Any help would be appreciated.
>>
>> --
>> Vicky
>>
>


-- 
Best Regards,

*Vicky Fazlurrahman*

*GO-JEK INDONESIA*
E-mailvi...@go-jek.com
Website www.go-jek.com
Phone+62 85729516006 <+62+85729516006>


Re: [DISCUSS] Drop Spark 1.x support to focus on Spark 2.x

2017-11-13 Thread Neville Dipale
Hi JB,


  [X ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
  [ ] 0 (I don't care ;))
  [ ] -1, I would like to still support Spark 1.x, and so having support of
both Spark 1.x and 2.x (please provide specific comment)

On 13 November 2017 at 10:32, Jean-Baptiste Onofré  wrote:

> Hi Beamers,
>
> I'm forwarding this discussion & vote from the dev mailing list to the
> user mailing list.
> The goal is to have your feedback as user.
>
> Basically, we have two options:
> 1. Right now, in the PR, we support both Spark 1.x and 2.x using three
> artifacts (common, spark1, spark2). You, as users, pick up spark1 or spark2
> in your dependencies set depending the Spark target version you want.
> 2. The other option is to upgrade and focus on Spark 2.x in Beam 2.3.0. If
> you still want to use Spark 1.x, then, you will be stuck up to Beam 2.2.0.
>
> Thoughts ?
>
> Thanks !
> Regards
> JB
>
>
>  Forwarded Message 
> Subject: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
> Date: Wed, 8 Nov 2017 08:27:58 +0100
> From: Jean-Baptiste Onofré 
> Reply-To: d...@beam.apache.org
> To: d...@beam.apache.org
>
> Hi all,
>
> as you might know, we are working on Spark 2.x support in the Spark runner.
>
> I'm working on a PR about that:
>
> https://github.com/apache/beam/pull/3808
>
> Today, we have something working with both Spark 1.x and 2.x from a code
> standpoint, but I have to deal with dependencies. It's the first step of
> the update as I'm still using RDD, the second step would be to support
> dataframe (but for that, I would need PCollection elements with schemas,
> that's another topic on which Eugene, Reuven and I are discussing).
>
> However, as all major distributions now ship Spark 2.x, I don't think it's
> required anymore to support Spark 1.x.
>
> If we agree, I will update and cleanup the PR to only support and focus on
> Spark 2.x.
>
> So, that's why I'm calling for a vote:
>
>   [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
>   [ ] 0 (I don't care ;))
>   [ ] -1, I would like to still support Spark 1.x, and so having support
> of both Spark 1.x and 2.x (please provide specific comment)
>
> This vote is open for 48 hours (I have the commits ready, just waiting the
> end of the vote to push on the PR).
>
> Thanks !
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: PubsubIO unable to set topic and subscription

2017-11-13 Thread Eugene Kirpichov
The error says "Can't set both the topic and the subscription": PubSub
subscribers read from a subscription, and messages sent to a topic are sent
to all subscriptions bound to this topic. That's why, when reading from
PubSub in Beam you can specify either a topic (then a new subscription to
this topic will be created specifically for this pipeline) or a
subscription (if you already have a subscription you want to use for this
pipeline). What are you trying to accomplish by specifying both at the same
time?

On Mon, Nov 13, 2017, 12:08 AM vicky fazlurrahman <
vicky.fazlurrah...@gmail.com> wrote:

>
> Hi all,
>
> I am using beam sdk 2.1.0, trying to run streaming pipeline, using
> dataflow-runner that consume pub sub stream with the following config :
>
> p.apply(PubsubIO.readStrings()
> .fromTopic("projects/my-gcp-project/topics/publisher-test")
>
> .fromSubscription("projects/my-gcp-project/subscriptions/subscibe-test"))
>
> The build is failed because it unable to set the topic and subscription
>
> [DEBUG] Setting accessibility to true in order to invoke main().
>
> Nov 13, 2017 2:23:03 PM org.apache.beam.runners.dataflow.DataflowRunner
> fromOptions
> INFO: PipelineOptions.filesToStage was not specified. Defaulting to files
> from the classpath: will stage 111 files. Enable logging at DEBUG level to
> see which files will be staged.
> [WARNING]
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.IllegalStateException: Can't set both the topic and
> the subscription for a PubsubIO.Read transform
> at
> org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Read.expand(PubsubIO.java:702)
> at
> org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Read.expand(PubsubIO.java:536)
> at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:514)
> at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:454)
> at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:44)
> at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:165)
> at com.example.drain.DrainPubsub.main(DrainPubsub.java:67)
> ... 6 more
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
>
> A bit confused, about the cause of the exception and what is the proper
> config for PubsubIO.
> Because seems, it can't get topic and subscription although given the
> correct format :
>
> projects//topics/
>
> projects//subscriptions/
>
> Any help would be appreciated.
>
> --
> Vicky
>


[DISCUSS] Drop Spark 1.x support to focus on Spark 2.x

2017-11-13 Thread Jean-Baptiste Onofré

Hi Beamers,

I'm forwarding this discussion & vote from the dev mailing list to the user 
mailing list.

The goal is to have your feedback as user.

Basically, we have two options:
1. Right now, in the PR, we support both Spark 1.x and 2.x using three artifacts 
(common, spark1, spark2). You, as users, pick up spark1 or spark2 in your 
dependencies set depending the Spark target version you want.
2. The other option is to upgrade and focus on Spark 2.x in Beam 2.3.0. If you 
still want to use Spark 1.x, then, you will be stuck up to Beam 2.2.0.


Thoughts ?

Thanks !
Regards
JB


 Forwarded Message 
Subject: [VOTE] Drop Spark 1.x support to focus on Spark 2.x
Date: Wed, 8 Nov 2017 08:27:58 +0100
From: Jean-Baptiste Onofré 
Reply-To: d...@beam.apache.org
To: d...@beam.apache.org

Hi all,

as you might know, we are working on Spark 2.x support in the Spark runner.

I'm working on a PR about that:

https://github.com/apache/beam/pull/3808

Today, we have something working with both Spark 1.x and 2.x from a code 
standpoint, but I have to deal with dependencies. It's the first step of the 
update as I'm still using RDD, the second step would be to support dataframe 
(but for that, I would need PCollection elements with schemas, that's another 
topic on which Eugene, Reuven and I are discussing).


However, as all major distributions now ship Spark 2.x, I don't think it's 
required anymore to support Spark 1.x.


If we agree, I will update and cleanup the PR to only support and focus on Spark 
2.x.


So, that's why I'm calling for a vote:

  [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
  [ ] 0 (I don't care ;))
  [ ] -1, I would like to still support Spark 1.x, and so having support of 
both Spark 1.x and 2.x (please provide specific comment)


This vote is open for 48 hours (I have the commits ready, just waiting the end 
of the vote to push on the PR).


Thanks !
Regards
JB
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


PubsubIO unable to set topic and subscription

2017-11-13 Thread vicky fazlurrahman
Hi all,

I am using beam sdk 2.1.0, trying to run streaming pipeline, using
dataflow-runner that consume pub sub stream with the following config :

p.apply(PubsubIO.readStrings()
.fromTopic("projects/my-gcp-project/topics/publisher-test")

.fromSubscription("projects/my-gcp-project/subscriptions/subscibe-test"))

The build is failed because it unable to set the topic and subscription

[DEBUG] Setting accessibility to true in order to invoke main().

Nov 13, 2017 2:23:03 PM org.apache.beam.runners.dataflow.DataflowRunner
fromOptions
INFO: PipelineOptions.filesToStage was not specified. Defaulting to files
from the classpath: will stage 111 files. Enable logging at DEBUG level to
see which files will be staged.
[WARNING]
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: Can't set both the topic and
the subscription for a PubsubIO.Read transform
at
org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Read.expand(PubsubIO.java:702)
at
org.apache.beam.sdk.io.gcp.pubsub.PubsubIO$Read.expand(PubsubIO.java:536)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:514)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:454)
at org.apache.beam.sdk.values.PBegin.apply(PBegin.java:44)
at org.apache.beam.sdk.Pipeline.apply(Pipeline.java:165)
at com.example.drain.DrainPubsub.main(DrainPubsub.java:67)
... 6 more
[INFO]

[INFO] BUILD FAILURE
[INFO]


A bit confused, about the cause of the exception and what is the proper
config for PubsubIO.
Because seems, it can't get topic and subscription although given the
correct format :

projects//topics/

projects//subscriptions/

Any help would be appreciated.

-- 
Vicky