Re: spark 2.0 readStream from a REST API

2016-08-11 Thread Sela, Amit
The current available output modes are Complete and Append. Complete mode is 
for stateful processing (aggregations), and Append mode for stateless 
processing (I.e., map/filter). See : 
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
Dataset#writeStream will produce a DataStreamWriter which allows you to start a 
query. This seems consistent with Spark’s previous behaviour of only executing 
upon an “action”, and the queries I guess are what “jobs” used to be.


Thanks,
Amit

From: Ayoub Benali 
>
Date: Tuesday, August 2, 2016 at 11:59 AM
To: user >
Cc: Jacek Laskowski >, Amit Sela 
>, Michael Armbrust 
>
Subject: Re: spark 2.0 readStream from a REST API

Why writeStream is needed to consume the data ?

When I tried it I got this exception:

INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
org.apache.spark.sql.AnalysisException: Complete output mode not supported when 
there are no streaming aggregations on streaming DataFrames/Datasets;
at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:173)
at 
org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:65)
at 
org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:236)
at 
org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:287)
at .(:59)



2016-08-01 18:44 GMT+02:00 Amit Sela 
>:
I think you're missing:

valquery=wordCounts.writeStream

  .outputMode("complete")
  .format("console")
  .start()

Dis it help ?

On Mon, Aug 1, 2016 at 2:44 PM Jacek Laskowski 
> wrote:
On Mon, Aug 1, 2016 at 11:01 AM, Ayoub Benali
> wrote:

> the problem now is that when I consume the dataframe for example with count
> I get the stack trace below.

Mind sharing the entire pipeline?

> I followed the implementation of TextSocketSourceProvider to implement my
> data source and Text Socket source is used in the official documentation
> here.

Right. Completely forgot about the provider. Thanks for reminding me about it!

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org




Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-21 Thread Sela, Amit
It seems I forgot to add the link to the “Technical Vision” paper so there it 
is - 
https://docs.google.com/document/d/1y4qlQinjjrusGWlgq-mYmbxRW2z7-_X5Xax-GG0YsC0/edit?usp=sharing

From: "Sela, Amit" <ans...@paypal.com<mailto:ans...@paypal.com>>
Date: Saturday, May 21, 2016 at 11:52 PM
To: Ovidiu-Cristian MARCU 
<ovidiu-cristian.ma...@inria.fr<mailto:ovidiu-cristian.ma...@inria.fr>>, "user 
@spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Cc: Ovidiu Cristian Marcu 
<ovidiu21ma...@gmail.com<mailto:ovidiu21ma...@gmail.com>>
Subject: Re: What / Where / When / How questions in Spark 2.0 ?

This is a “Technical Vision” paper for the Spark runner, which provides general 
guidelines to the future development of Spark’s Beam support as part of the 
Apache Beam (incubating) project.
This is our JIRA - 
https://issues.apache.org/jira/browse/BEAM/component/12328915/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel

Generally, I’m currently working on Datasets integration for Batch (to replace 
RDD) against Spark 1.6, and going towards enhancing Stream processing 
capabilities with Structured Streaming (2.0)

And you’re welcomed to ask those questions at the Apache Beam (incubating) 
mailing list as well ;)
http://beam.incubator.apache.org/mailing_lists/

Thanks,
Amit

From: Ovidiu-Cristian MARCU 
<ovidiu-cristian.ma...@inria.fr<mailto:ovidiu-cristian.ma...@inria.fr>>
Date: Tuesday, May 17, 2016 at 12:11 AM
To: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Cc: Ovidiu Cristian Marcu 
<ovidiu21ma...@gmail.com<mailto:ovidiu21ma...@gmail.com>>
Subject: Re: What / Where / When / How questions in Spark 2.0 ?

Could you please consider a short answer regarding the Apache Beam Capability 
Matrix todo’s for future Spark 2.0 release [4]? (some related references below 
[5][6])

Thanks

[4] http://beam.incubator.apache.org/capability-matrix/#cap-full-what
[5] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[6] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

On 16 May 2016, at 14:18, Ovidiu-Cristian MARCU 
<ovidiu-cristian.ma...@inria.fr<mailto:ovidiu-cristian.ma...@inria.fr>> wrote:

Hi,

We can see in [2] many interesting (and expected!) improvements (promises) like 
extended SQL support, unified API (DataFrames, DataSets), improved engine 
(Tungsten relates to ideas from modern compilers and MPP databases - similar to 
Flink [3]), structured streaming etc. It seems we somehow assist at a smart 
unification of Big Data analytics (Spark, Flink - best of two worlds)!

How does Spark respond to the missing What/Where/When/How questions 
(capabilities) highlighted in the unified model Beam [1] ?

Best,
Ovidiu

[1] 
https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
[2] 
https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
[3] http://stratosphere.eu/project/publications/





Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-21 Thread Sela, Amit
This is a “Technical Vision” paper for the Spark runner, which provides general 
guidelines to the future development of Spark’s Beam support as part of the 
Apache Beam (incubating) project.
This is our JIRA - 
https://issues.apache.org/jira/browse/BEAM/component/12328915/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel

Generally, I’m currently working on Datasets integration for Batch (to replace 
RDD) against Spark 1.6, and going towards enhancing Stream processing 
capabilities with Structured Streaming (2.0)

And you’re welcomed to ask those questions at the Apache Beam (incubating) 
mailing list as well ;)
http://beam.incubator.apache.org/mailing_lists/

Thanks,
Amit

From: Ovidiu-Cristian MARCU 
>
Date: Tuesday, May 17, 2016 at 12:11 AM
To: "user @spark" >
Cc: Ovidiu Cristian Marcu 
>
Subject: Re: What / Where / When / How questions in Spark 2.0 ?

Could you please consider a short answer regarding the Apache Beam Capability 
Matrix todo’s for future Spark 2.0 release [4]? (some related references below 
[5][6])

Thanks

[4] http://beam.incubator.apache.org/capability-matrix/#cap-full-what
[5] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[6] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

On 16 May 2016, at 14:18, Ovidiu-Cristian MARCU 
> wrote:

Hi,

We can see in [2] many interesting (and expected!) improvements (promises) like 
extended SQL support, unified API (DataFrames, DataSets), improved engine 
(Tungsten relates to ideas from modern compilers and MPP databases - similar to 
Flink [3]), structured streaming etc. It seems we somehow assist at a smart 
unification of Big Data analytics (Spark, Flink - best of two worlds)!

How does Spark respond to the missing What/Where/When/How questions 
(capabilities) highlighted in the unified model Beam [1] ?

Best,
Ovidiu

[1] 
https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
[2] 
https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
[3] http://stratosphere.eu/project/publications/





Apache Beam Spark runner

2016-03-19 Thread Sela, Amit
Hi all,

The Apache Beam Spark runner is now available at: 
https://github.com/apache/incubator-beam/tree/master/runners/spark Check it out!
The Apache Beam (http://beam.incubator.apache.org/) project is a unified model 
for building data pipelines using Google’s Dataflow programming model, and now 
it supports Spark as well!

Take it for a ride on your Spark cluster!

Thanks,
Amit




Why does DStream have a different StorageLevel than RDD ?

2016-01-26 Thread Sela, Amit
I was wondering why does DStream and RDD have a different cache() StorageLevel ?

Thanks,
Amit


Support for custom serializers in Checkpoint

2015-12-06 Thread Sela, Amit
Why does Spark allows only Java Serializable in Checkpointing ? I see in 
Checkpoint.serialize() that it doesn’t even try to load a serializer from the 
configuration and uses Java’s ObjectOutputStream.

This means that I can’t use Avro (fro eaxmple) in updateStateByKey, right ?

Is there a reason for that ? Are there plans to change that ?

Thanks,
Amit


Accumulators internals and reliability

2015-10-26 Thread Sela, Amit
It seems like there is not much literature about Spark's Accumulators so I 
thought I'd ask here:

Do Accumulators reside in a Task ? Are they being serialized with the task ? 
Sent back on task completion as part of the ResultTask ?

Are they reliable ? If so, when ? Can I relay on accumulators value only after 
the task was successfully complete (meaning in the driver) ? Or also during the 
task execution as well (what about speculative execution) ?

What are the limitations on the number (or size) of Accumulators ?

Thanks,
Amit


NullPointerException when adding to accumulator

2015-10-14 Thread Sela, Amit
I'm running a simple streaming application that reads from Kafka, maps the 
events and prints them and I'm trying to use accumulators to count the number 
of mapped records.

While this works in standalone(IDE), when submitting to YARN I get 
NullPointerException on accumulator.add(1) or accumulator += 1

Anyone using accumulators in .map() with Spark 1.5 and YARN ?

Thanks,
Amit





Anyone using Intel's spark-streamingsql project to execute SQL queries over Spark streaming ?

2015-07-29 Thread Sela, Amit
Is there an available release?
Anyone using in production?
Is the project being actively developed and maintained?

Thanks!


How does Spark streaming move data around ?

2015-07-06 Thread Sela, Amit
I know that Spark is using data parallelism over, say, HDFS - optimally running 
computations on local data (aka data locality).
I was wondering how Spark streaming moves data (messages) around? since the 
data is streamed in as DStreams and is not on a distributed FS like HDFS.

Thanks!