The current available output modes are Complete and Append. Complete mode is
for stateful processing (aggregations), and Append mode for stateless
processing (I.e., map/filter). See :
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
It seems I forgot to add the link to the “Technical Vision” paper so there it
is -
https://docs.google.com/document/d/1y4qlQinjjrusGWlgq-mYmbxRW2z7-_X5Xax-GG0YsC0/edit?usp=sharing
From: "Sela, Amit" <ans...@paypal.com<mailto:ans...@paypal.com>>
Date: Saturday, Ma
This is a “Technical Vision” paper for the Spark runner, which provides general
guidelines to the future development of Spark’s Beam support as part of the
Apache Beam (incubating) project.
This is our JIRA -
Hi all,
The Apache Beam Spark runner is now available at:
https://github.com/apache/incubator-beam/tree/master/runners/spark Check it out!
The Apache Beam (http://beam.incubator.apache.org/) project is a unified model
for building data pipelines using Google’s Dataflow programming model, and
I was wondering why does DStream and RDD have a different cache() StorageLevel ?
Thanks,
Amit
Why does Spark allows only Java Serializable in Checkpointing ? I see in
Checkpoint.serialize() that it doesn’t even try to load a serializer from the
configuration and uses Java’s ObjectOutputStream.
This means that I can’t use Avro (fro eaxmple) in updateStateByKey, right ?
Is there a reason
It seems like there is not much literature about Spark's Accumulators so I
thought I'd ask here:
Do Accumulators reside in a Task ? Are they being serialized with the task ?
Sent back on task completion as part of the ResultTask ?
Are they reliable ? If so, when ? Can I relay on accumulators
I'm running a simple streaming application that reads from Kafka, maps the
events and prints them and I'm trying to use accumulators to count the number
of mapped records.
While this works in standalone(IDE), when submitting to YARN I get
NullPointerException on accumulator.add(1) or
Is there an available release?
Anyone using in production?
Is the project being actively developed and maintained?
Thanks!
I know that Spark is using data parallelism over, say, HDFS - optimally running
computations on local data (aka data locality).
I was wondering how Spark streaming moves data (messages) around? since the
data is streamed in as DStreams and is not on a distributed FS like HDFS.
Thanks!
10 matches
Mail list logo