Re: Need guidelines in Spark Streaming and Kafka integration

2016-11-16 Thread Jon Gregg
16, 2016, Karim, Md. Rezaul < > rezaul.ka...@insight-centre.org> wrote: > >> Hi All, >> >> I am completely new with Kafka. I was wondering if somebody could provide >> me some guidelines on how to develop real-time streaming applications using >> Spark Stre

Re: Need guidelines in Spark Streaming and Kafka integration

2016-11-16 Thread Mohammad Tariq
w with Kafka. I was wondering if somebody could provide > me some guidelines on how to develop real-time streaming applications using > Spark Streaming API with Kafka. > > I am aware the Spark Streaming and Kafka integration [1]. However, a real > life example should be better to start? >

Need guidelines in Spark Streaming and Kafka integration

2016-11-16 Thread Karim, Md. Rezaul
Hi All, I am completely new with Kafka. I was wondering if somebody could provide me some guidelines on how to develop real-time streaming applications using Spark Streaming API with Kafka. I am aware the Spark Streaming and Kafka integration [1]. However, a real life example should be better

Re: Spark Streaming: question on sticky session across batches ?

2016-11-15 Thread Manish Malhotra
10, 2016 at 8:42 AM, Manish Malhotra < > manish.malhotra.w...@gmail.com> wrote: > > Hello Spark Devs/Users, > > Im trying to solve the use case with Spark Streaming 1.6.2 where for every > batch ( say 2 mins) data needs to go to the same reducer node after > grouping by key

Re: Spark Streaming: question on sticky session across batches ?

2016-11-15 Thread Takeshi Yamamuro
wrote: > sending again. > any help is appreciated ! > > thanks in advance. > > On Thu, Nov 10, 2016 at 8:42 AM, Manish Malhotra < > manish.malhotra.w...@gmail.com> wrote: > >> Hello Spark Devs/Users, >> >> Im trying to solve the use case with Spark Streaming 1

Re: spark streaming with kinesis

2016-11-14 Thread Takeshi Yamamuro
Mon, Nov 14, 2016 at 5:43 PM, Takeshi Yamamuro <linguin@gmail.com >>> > wrote: >>> >>>> Hi, >>>> >>>> The time interval can be controlled by `IdleTimeBetweenReadsInMillis` >>>> in KinesisClientLibConfiguration though, >>

Re: Spark Streaming: question on sticky session across batches ?

2016-11-14 Thread Manish Malhotra
sending again. any help is appreciated ! thanks in advance. On Thu, Nov 10, 2016 at 8:42 AM, Manish Malhotra < manish.malhotra.w...@gmail.com> wrote: > Hello Spark Devs/Users, > > Im trying to solve the use case with Spark Streaming 1.6.2 where for every > batch ( say 2 mins

Spark streaming data loss due to timeout in writing BlockAdditionEvent to WAL by the driver

2016-11-14 Thread Arijit
) at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:232) at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:87) at org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock

Re: spark streaming with kinesis

2016-11-14 Thread Shushant Arora
configurable in the current implementation. >>> >>> The detail can be found in; >>> https://github.com/apache/spark/blob/master/external/kinesis >>> -asl/src/main/scala/org/apache/spark/streaming/kinesis/ >>> KinesisReceiver.scala#L152 >>> &g

Re: spark streaming with kinesis

2016-11-14 Thread Takeshi Yamamuro
rable in the current implementation. >> >> The detail can be found in; >> https://github.com/apache/spark/blob/master/external/kinesis >> -asl/src/main/scala/org/apache/spark/streaming/kinesis >> /KinesisReceiver.scala#L152 >> >&

Re: spark streaming with kinesis

2016-11-14 Thread Shushant Arora
ntLibConfiguration > though, > it is not configurable in the current implementation. > > The detail can be found in; > https://github.com/apache/spark/blob/master/external/ > kinesis-asl/src/main/scala/org/apache/spark/streaming/ > kinesis/KinesisReceiver.scala#L152 > > //

Re: spark streaming with kinesis

2016-11-14 Thread Takeshi Yamamuro
Hi, The time interval can be controlled by `IdleTimeBetweenReadsInMillis` in KinesisClientLibConfiguration though, it is not configurable in the current implementation. The detail can be found in; https://github.com/apache/spark/blob/master/external/kinesis-asl/src/main/scala/org/apache/spark

receiver based spark streaming doubts

2016-11-13 Thread Shushant Arora
Hi In spark streaming based on receivers - when receiver gets data and store in blocks for workers to process, How many blocks does receiver gives to worker. Say I have a streaming app with 30 sec of batch interval what will happen 1.for first batch(first 30 sec) there will not be any data

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-12 Thread shyla deshpande
each >> executor. >> >> I realize that this is not the original question on this thread, and I >> don't meant to hijack that. I am also interested in the potential of Akka >> Stream sources for a Spark Streaming job directly, which could potentially >> be adapted for b

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-12 Thread shyla deshpande
ad, and I > don't meant to hijack that. I am also interested in the potential of Akka > Stream sources for a Spark Streaming job directly, which could potentially > be adapted for both Kafka and non-kafka use cases, with the emphasis for me > being on use cases which aren't necessarily Kafka

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread dev loper
t; > On Sat, Nov 12, 2016 at 9:55 PM, Cody Koeninger <c...@koeninger.org> >> wrote: >> >> >> >> Are you certain you aren't getting any failed tasks or other errors? >> >> Output actions like foreach aren't exactly once and will be retried on >&g

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread ayan guha
<c...@koeninger.org> > wrote: > >> > >> Are you certain you aren't getting any failed tasks or other errors? > >> Output actions like foreach aren't exactly once and will be retried on > >> failures. > >> > >> > >> On Nov 12,

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-12 Thread Sean McKibben
to hijack that. I am also interested in the potential of Akka Stream sources for a Spark Streaming job directly, which could potentially be adapted for both Kafka and non-kafka use cases, with the emphasis for me being on use cases which aren't necessarily Kafka specific. There are some portions which

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread Cody Koeninger
actly once and will be retried on >> failures. >> >> >> On Nov 12, 2016 06:36, "dev loper" <spark...@gmail.com> wrote: >>> >>> Dear fellow Spark Users, >>> >>> My Spark Streaming application (Spark 2.0 , on AWS EMR yarn c

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-12 Thread Jacek Laskowski
; be interested in collaborating in the structured streaming version for sure. > > On Thu, Nov 10, 2016 at 8:46 AM shyla deshpande <deshpandesh...@gmail.com> > wrote: >> >> I am using Spark 2.0.1. I wanted to build a data pipeline using Kafka, >> Spark Streaming and Cassandra usi

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread dev loper
led tasks or other errors? > Output actions like foreach aren't exactly once and will be retried on > failures. > > On Nov 12, 2016 06:36, "dev loper" <spark...@gmail.com> wrote: > >> Dear fellow Spark Users, >> >> My Spark Streaming application (Spark 2.

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread Cody Koeninger
Are you certain you aren't getting any failed tasks or other errors? Output actions like foreach aren't exactly once and will be retried on failures. On Nov 12, 2016 06:36, "dev loper" <spark...@gmail.com> wrote: > Dear fellow Spark Users, > > My Spark Streaming applic

spark streaming with kinesis

2016-11-12 Thread Shushant Arora
*Hi * *is **spark.streaming.blockInterval* for kinesis input stream is hardcoded to 1 sec or is it configurable ? Time interval at which receiver fetched data from kinesis . Means stream batch interval cannot be less than *spark.streaming.blockInterval and this should be configrable , Also is

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-12 Thread Luciano Resende
esh...@gmail.com> wrote: > I am using Spark 2.0.1. I wanted to build a data pipeline using Kafka, > Spark Streaming and Cassandra using Structured Streaming. But the kafka > source support for Structured Streaming is not yet available. So now I am > trying to use Akka Stream as the source

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-12 Thread Jacek Laskowski
laskowski On Thu, Nov 10, 2016 at 8:46 AM, shyla deshpande <deshpandesh...@gmail.com> wrote: > I am using Spark 2.0.1. I wanted to build a data pipeline using Kafka, Spark > Streaming and Cassandra using Structured Streaming. But the kafka source > support for Structured Streaming

Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread dev loper
Dear fellow Spark Users, My Spark Streaming application (Spark 2.0 , on AWS EMR yarn cluster) listens to Campaigns based on live stock feeds and the batch duration is 5 seconds. The applications uses Kafka DirectStream and based on the feed source there are three streams. As given in the code

Anyone using ProtoBuf for Kafka messages with Spark Streaming for processing?

2016-11-10 Thread shyla deshpande
Using ProtoBuf for Kafka messages with Spark Streaming because ProtoBuf is already being used in the system. Some sample code and reading material for using ProtoBuf for Kafka messages with Spark Streaming will be helpful. Thanks

Spark Streaming: question on sticky session across batches ?

2016-11-10 Thread Manish Malhotra
Hello Spark Devs/Users, Im trying to solve the use case with Spark Streaming 1.6.2 where for every batch ( say 2 mins) data needs to go to the same reducer node after grouping by key. The underlying storage is Cassandra and not HDFS. This is a map-reduce job, where also trying to use

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-10 Thread Cody Koeninger
il.com> wrote: I have already integrated common actors. I am also interested, specially to see how we can achieve end to end back pressure. 2016-11-10 8:46 GMT+01:00 shyla deshpande <deshpandesh...@gmail.com>: > I am using Spark 2.0.1. I wanted to build a data pipeline using

Re: Akka Stream as the source for Spark Streaming. Please advice...

2016-11-10 Thread vincent gromakowski
I have already integrated common actors. I am also interested, specially to see how we can achieve end to end back pressure. 2016-11-10 8:46 GMT+01:00 shyla deshpande <deshpandesh...@gmail.com>: > I am using Spark 2.0.1. I wanted to build a data pipeline using Kafka, > Spa

Akka Stream as the source for Spark Streaming. Please advice...

2016-11-09 Thread shyla deshpande
I am using Spark 2.0.1. I wanted to build a data pipeline using Kafka, Spark Streaming and Cassandra using Structured Streaming. But the kafka source support for Structured Streaming is not yet available. So now I am trying to use Akka Stream as the source to Spark Streaming. Want to make sure I

Re: Using Apache Spark Streaming - how to handle changing data format within stream

2016-11-09 Thread coolgar
uot;).foreach { line => if // it's a header parser = someParserBasedOn(line) else items += parser.parse(line) } items.iterator } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Apache-Spark-Streaming-how-to-handle-changing-da

Spark streaming delays spikes

2016-11-09 Thread Shlomi.b
Hi, We are using spark streaming version 1.6.2 and came across a weird behavior. Our system pulls log events data from flume servers, enrich the events and save them to ES. We are using window interval of 15 seconds and the rate on peak hours is around 70K events. The average time to process

Re: Spark Streaming Data loss on failure to write BlockAdditionEvent failure to WAL

2016-11-08 Thread Arijit
ark.apache.org Subject: Re: Spark Streaming Data loss on failure to write BlockAdditionEvent failure to WAL For WAL in Spark to work with HDFS, the HDFS version you are running must support file appends. Contact your HDFS package/installation provider to figure out whether this is supported by your

Spark streaming uses lesser number of executors

2016-11-08 Thread Aravindh
Hi, I am using spark streaming process some events. It is deployed in standalone mode with 1 master and 3 workers. I have set number of cores per executor to 4 and total num of executors to 24. This means totally 6 executors will be spawned. I have set spread-out to true. So each worker machine

Re: spark streaming with kinesis

2016-11-07 Thread Takeshi Yamamuro
// maropu On Mon, Nov 7, 2016 at 1:36 PM, Shushant Arora <shushantaror...@gmail.com> wrote: > Hi > > By receicer I meant spark streaming receiver architecture- means worker > nodes are different than receiver nodes. There is no direct consumer/low > level consumer like of K

Re: Spark Streaming Data loss on failure to write BlockAdditionEvent failure to WAL

2016-11-07 Thread Tathagata Das
omise@6d8f1feb)) > java.lang.IllegalStateException: File exists and there is no append > support! > at org.apache.spark.streaming.util.HdfsUtils$. > getOutputStream(HdfsUtils.scala:35) > at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter. > org$ap

Re: Using Apache Spark Streaming - how to handle changing data format within stream

2016-11-07 Thread Cody Koeninger
if // it's a header parser = someParserBasedOn(line) else items += parser.parse(line) } items.iterator } On Mon, Nov 7, 2016 at 4:22 PM, coolgar <karllbunn...@gmail.com> wrote: > I'm using apache spark streaming with the kafka direct consumer. The data > stre

Using Apache Spark Streaming - how to handle changing data format within stream

2016-11-07 Thread coolgar
I'm using apache spark streaming with the kafka direct consumer. The data stream I'm receiving is log data that includes a header with each block of messages. Each DStream can therefore have many blocks of messages, each with it's own header. The header is used to know how to interpret

Spark Streaming Data loss on failure to write BlockAdditionEvent failure to WAL

2016-11-07 Thread Arijit
m(HdfsUtils.scala:35) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.org$apache$spark$streaming$util$FileBasedWriteAheadLogWriter$$stream$lzycompute(FileBasedWriteAheadLogWriter.scala:33) at org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.org$apache$spark$streaming$util$FileBasedWriteAheadLogWriter$$stream(FileBasedWriteAhe

Re: Spark Streaming backpressure weird behavior/bug

2016-11-07 Thread Michael Segel
Spark inherits its security from the underlying mechanisms in either YARN or MESOS (whichever environment you are launching your cluster/jobs) That said… there is limited support from Ranger. There are three parts to this… 1) Ranger being called when the job is launched… 2) Ranger being

Spark Streaming backpressure weird behavior/bug

2016-11-07 Thread Mudit Kumar
Hi, Do ranger provide security to spark?If yes,then in what capacity. Thanks, Mudit

Re: spark streaming with kinesis

2016-11-06 Thread Shushant Arora
Hi By receicer I meant spark streaming receiver architecture- means worker nodes are different than receiver nodes. There is no direct consumer/low level consumer like of Kafka in kinesis spark streaming? Is there any limitation on interval checkpoint - minimum of 1second in spark streaming

Spark Streaming and Kinesis

2016-10-27 Thread Benjamin Kim
Has anyone worked with AWS Kinesis and retrieved data from it using Spark Streaming? I am having issues where it’s returning no data. I can connect to the Kinesis stream and describe using Spark. Is there something I’m missing? Are there specific IAM security settings needed? I just simply

Re: Spark streaming communication with different versions of kafka

2016-10-25 Thread Cody Koeninger
now if the same spark streaming job can consume from kafka > 0.8.1 and write the data to kafka 0.9. Just trying to replicate the kafka > server. > > Yes, Kafka's MirrorMaker can be used to replicate, but was curious to know > if that can be achieved by spark streaming.

Spark streaming communication with InfluxDB

2016-10-25 Thread Gioacchino
Hi, I wouild like to know if there is code example to write data in InfluxDB from Spark Streaming in Scala / Python. Thanks in advance Gioacchino - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark streaming communication with different versions of kafka

2016-10-25 Thread Prabhu GS
Hi, I would like to know if the same spark streaming job can consume from kafka 0.8.1 and write the data to kafka 0.9. Just trying to replicate the kafka server. Yes, Kafka's MirrorMaker can be used to replicate, but was curious to know if that can be achieved by spark streaming. Please share

Re: spark streaming with kinesis

2016-10-24 Thread Takeshi Yamamuro
heckpoint the sequence no using some api. > > > > On Tue, Oct 25, 2016 at 7:07 AM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> Hi, >> >> The only thing you can do for Kinesis checkpoints is tune the interval of >> them. >> https://github.com/apach

Re: spark streaming with kinesis

2016-10-24 Thread Shushant Arora
gt; replicated across executors. > However, all the executors that have the replicated data crash, > IIUC the dataloss occurs. > > // maropu > > On Mon, Oct 24, 2016 at 4:43 PM, Shushant Arora <shushantaror...@gmail.com > > wrote: > >> Does spark streaming c

Re: spark streaming with kinesis

2016-10-24 Thread Takeshi Yamamuro
Hi, The only thing you can do for Kinesis checkpoints is tune the interval of them. https://github.com/apache/spark/blob/master/external/ kinesis-asl/src/main/scala/org/apache/spark/streaming/ kinesis/KinesisUtils.scala#L68 Whether the dataloss occurs or not depends on the storage level you set

Re: Issues with reading gz files with Spark Streaming

2016-10-24 Thread Steve Loughran
On 22 Oct 2016, at 20:58, Nkechi Achara > wrote: I do not use rename, and the files are written to, and then moved to a directory on HDFS in gz format. in that case there's nothing obvious to mee. try logging at trace/debug the class:

spark streaming with kinesis

2016-10-24 Thread Shushant Arora
Does spark streaming consumer for kinesis uses Kinesis Client Library and mandates to checkpoint the sequence number of shards in dynamo db. Will it lead to dataloss if consumed datarecords are not yet processed and kinesis checkpointed the consumed sequenece numbers in dynamo db and spark

Spark streaming crashes with high throughput

2016-10-23 Thread Jeyhun Karimov
Hi, I am getting *Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.* error with spark streaming job. I am using spark 2.0.0. The job is simple windowed aggregation and the stream is read from socket. Average

Re: Issues with reading gz files with Spark Streaming

2016-10-22 Thread Nkechi Achara
I do not use rename, and the files are written to, and then moved to a directory on HDFS in gz format. On 22 October 2016 at 15:14, Steve Loughran wrote: > > > On 21 Oct 2016, at 15:53, Nkechi Achara wrote: > > > > Hi, > > > > I am using Spark

Re: Issues with reading gz files with Spark Streaming

2016-10-22 Thread Steve Loughran
> On 21 Oct 2016, at 15:53, Nkechi Achara wrote: > > Hi, > > I am using Spark 1.5.0 to read gz files with textFileStream, but when new > files are dropped in the specified directory. I know this is only the case > with gz files as when i extract the file into the

Issues with reading gz files with Spark Streaming

2016-10-21 Thread Nkechi Achara
Hi, I am using Spark 1.5.0 to read gz files with textFileStream, but when new files are dropped in the specified directory. I know this is only the case with gz files as when i extract the file into the directory specified the files are read on the next window and processed. My code is here:

spark streaming client program needs to be restarted after few hours of idle time. how can I fix it?

2016-10-18 Thread kant kodali
Hi Guys, My Spark Streaming Client program works fine as the long as the receiver receives the data but say my receiver has no more data to receive for few hours like (4-5 hours) and then its starts receiving the data again at that point spark client program doesn't seem to process any data

Re: Spark Streaming 2 Kafka 0.10 Integration for Aggregating Data

2016-10-18 Thread Sean Owen
afka at my backend which acts like a bus for messages. Whenever a > search request is done I publish the nano time of the current system. I > want to use Spark Streaming to aggregate such data but I am so new to it. > > I wanted to follow that example: > http://spark.apache.org

Spark Streaming 2 Kafka 0.10 Integration for Aggregating Data

2016-10-18 Thread Furkan KAMACI
Hi, I have a search application and want to monitor queries per second for it. I have Kafka at my backend which acts like a bus for messages. Whenever a search request is done I publish the nano time of the current system. I want to use Spark Streaming to aggregate such data but I am so new

Re: Limit Kafka batches size with Spark Streaming

2016-10-13 Thread Samy Dindane
57 AM, Samy Dindane <s...@dindane.com> wrote: Hi, Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming? I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them. Thank

Re: Limit Kafka batches size with Spark Streaming

2016-10-13 Thread Cody Koeninger
gt;>> >>>>> >>>>>> >>>>>> On Wed, Oct 12, 2016 at 8:53 AM, Samy Dindane <s...@dindane.com> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> That's what I was looking for, thank yo

Re: Limit Kafka batches size with Spark Streaming

2016-10-13 Thread Samy Dindane
6:08 PM, Cody Koeninger wrote: http://spark.apache.org/docs/latest/configuration.html "This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below)." On Tue, Oct 11, 2016 at 10:57 AM, Samy Dindane <s...

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Cody Koeninger
;>> >>>>> The only configuration that works is >>>>> "spark.streaming.kafka.maxRatePerPartition". >>>>> That's better than nothing, but I'd be useful to have backpressure >>>>> enabled >>>>> for automati

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Samy Dindane
by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below)." On Tue, Oct 11, 2016 at 10:57 AM, Samy Dindane <s...@dindane.com> wrote: Hi, Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Stream

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Cody Koeninger
t; spark.streaming.receiver.maxRate and >> spark.streaming.kafka.maxRatePerPartition if they are set (see >> below)." >> >> On Tue, Oct 11, 2016 at 10:57 AM, Samy Dindane <s...@dindane.com> wrote: >>> >>> Hi, >>> >>> Is it possible to limit t

Re: Limit Kafka batches size with Spark Streaming

2016-10-12 Thread Samy Dindane
dindane.com> wrote: Hi, Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming? I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoi

Recommended way to run spark streaming in production in EMR

2016-10-11 Thread pandees waran
All, We have an use case in which 2 spark streaming jobs in same EMR cluster. I am thinking of allowing multiple streaming contexts and run them as 2 separate spark-submit with wait for app completion set to false. With this, the failure detection and monitoring seems obscure and doesn't seem

Re: Limit Kafka batches size with Spark Streaming

2016-10-11 Thread Cody Koeninger
> wrote: > Hi, > > Is it possible to limit the size of the batches returned by the Kafka > consumer for Spark Streaming? > I am asking because the first batch I get has hundred of millions of records > and it takes ages to process and checkp

Limit Kafka batches size with Spark Streaming

2016-10-11 Thread Samy Dindane
Hi, Is it possible to limit the size of the batches returned by the Kafka consumer for Spark Streaming? I am asking because the first batch I get has hundred of millions of records and it takes ages to process and checkpoint them. Thank you. Samy

Re: Spark Streaming Advice

2016-10-10 Thread Jörn Franke
lly > improved the performance of my particular Spark Streaming application. I'm > sharing the details in hopes that it may help somebody in a similar situation. > > As my program ingested information into HDFS (as parquet files), I noticed > that the time to process each batch wa

Re: Spark Streaming Advice

2016-10-10 Thread Kevin Mellott
il's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 10 October 2016 at 15:25, Kevin Mellott <kevin.r.mell...@gmail.com> > wrote: > >> Whilst w

Re: Spark Streaming Advice

2016-10-10 Thread Mich Talebzadeh
application, I found a setting that drastically > improved the performance of my particular Spark Streaming application. I'm > sharing the details in hopes that it may help somebody in a similar > situation. > > As my program ingested information into HDFS (as parquet files), I noticed &

Re: Spark Streaming Advice

2016-10-10 Thread Kevin Mellott
Whilst working on this application, I found a setting that drastically improved the performance of my particular Spark Streaming application. I'm sharing the details in hopes that it may help somebody in a similar situation. As my program ingested information into HDFS (as parquet files), I

Partitioned windows in spark streaming

2016-10-02 Thread Adrienne Kole
Hi, Is spark 2.0.0 supports partitioned windows in streaming? Cheers Adrienne

Grouped windows in spark streaming

2016-09-30 Thread Adrienne Kole
Hi all, I am using Spark Streaming for my use case. I want to - partition or group the stream by key - window the tuples in partitions and - find max/min element in windows (in every partition) My code is like: val keyedStream = socketDataSource.map(s => (s.key,s.va

spark streaming minimum batch interval

2016-09-29 Thread Shushant Arora
Hi I want to enquire does spark streaming has some limitation of 500ms of batch intreval ? Is storm better than spark streaming for real time (for latency of just 50-100ms). In spark streaming can parallel batches be run ? If yes is it supported at productionlevel. Thanks

Re: Can Spark Streaming 2.0 work with Kafka 0.10?

2016-09-26 Thread Cody Koeninger
Either artifact should work with 0.10 brokers. The 0.10 integration has more features but is still marked experimental. On Sep 26, 2016 3:41 AM, "Haopu Wang" <hw...@qilinsoft.com> wrote: > Hi, in the official integration guide, it says "Spark Streaming 2.0.0 is > co

Can Spark Streaming 2.0 work with Kafka 0.10?

2016-09-26 Thread Haopu Wang
Hi, in the official integration guide, it says "Spark Streaming 2.0.0 is compatible with Kafka 0.8.2.1." However, in maven repository, I can get "spark-streaming-kafka-0-10_2.11" which depends on Kafka 0.10.0.0 Is this artifact stable enough? Thank you!

Re: ideas on de duplication for spark streaming?

2016-09-24 Thread Jörn Franke
ta coming in to my spark streaming cluster from a message > queue(not kafka). And this message queue guarantees at least once delivery > only so there is potential that some of the messages that come in to the > spark streaming cluster are actually duplicates and I am trying to figure out

Re: ideas on de duplication for spark streaming?

2016-09-24 Thread Cody Koeninger
1:49 AM, "kant kodali" <kanth...@gmail.com> wrote: > Hi Guys, > > I have bunch of data coming in to my spark streaming cluster from a > message queue(not kafka). And this message queue guarantees at least once > delivery only so there is potential that

ideas on de duplication for spark streaming?

2016-09-24 Thread kant kodali
Hi Guys, I have bunch of data coming in to my spark streaming cluster from a message queue(not kafka). And this message queue guarantees at least once delivery only so there is potential that some of the messages that come in to the spark streaming cluster are actually duplicates and I am trying

Re: The coming data on Spark Streaming

2016-09-21 Thread pcandido
Anybody? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-coming-data-on-Spark-Streaming-tp27720p27771.html Sent from the Apache Spark User List mailing list archive at Nabble.com

spark streaming slow checkpointing when calling Rserve

2016-09-19 Thread Piubelli, Manuel
Hello, I wrote a spark streaming application in Java. It reads stock trades off of a data feed receiver and converts them to Tick objects, and uses a microbatch interval, window interval and sliding interval of 10 seconds. A JavaPairDStream<String, Iterable> is created where t

Re: Getting empty values while receiving from kafka Spark streaming

2016-09-18 Thread Chawla,Sumit
How are you producing data? I just tested your code and i can receive the messages from Kafka. Regards Sumit Chawla On Sun, Sep 18, 2016 at 7:56 PM, Sateesh Karuturi < sateesh.karutu...@gmail.com> wrote: > i am very new to *Spark streaming* and i am implementing small exercise >

Re: Getting empty values while receiving from kafka Spark streaming

2016-09-18 Thread ayan guha
com> wrote: > i am very new to *Spark streaming* and i am implementing small exercise > like sending *XML* data from *kafka* and need to receive that *streaming* data > through *spark streaming.* I tried in all possible ways.. but every time > i am getting *empty values.* > >

Getting empty values while receiving from kafka Spark streaming

2016-09-18 Thread Sateesh Karuturi
i am very new to *Spark streaming* and i am implementing small exercise like sending *XML* data from *kafka* and need to receive that *streaming* data through *spark streaming.* I tried in all possible ways.. but every time i am getting *empty values.* *There is no problem in Kafka side, only

Re: spark streaming kafka connector questions

2016-09-16 Thread 毅程
s for time > 1473491465000 ms (execution: 0.066 s) *(EVENT 2nd time process cost 0.066)* > > and the 2nd time processing of the event finished without really doing the > work. > > Help is hugely appreciated. > > > > -- > View this message in context: http://apache-spark-

Re: Spark Streaming-- for each new file in HDFS

2016-09-16 Thread Steve Loughran
commend that the third party application puts an empty file with the same filename as the original file, but the extension ".uploaded". This is an indicator that the file has been fully (!) written to the fs. Otherwise you risk only reading parts of the file. Then, you can have a

Re: Spark Streaming-- for each new file in HDFS

2016-09-15 Thread Peyman Mohajerian
tem listener for this .upload file. > > Spark streaming or Kafka are not needed/suitable, if the server is a file > server. You can use oozie (maybe with a simple custom action) to poll for > .uploaded files and transmit them. > > On 15 Sep 2016, at 19:00, Kappaganthu, Sivaram (E

Re: Spark Streaming-- for each new file in HDFS

2016-09-15 Thread Jörn Franke
ave a file system listener for this .upload file. Spark streaming or Kafka are not needed/suitable, if the server is a file server. You can use oozie (maybe with a simple custom action) to poll for .uploaded files and transmit them. > On 15 Sep 2016, at 19:00, Kappaganthu, Sivaram (ES) > &

Spark Streaming-- for each new file in HDFS

2016-09-15 Thread Kappaganthu, Sivaram (ES)
) is it possible for spark-stream to trigger a job after a file is placed instead of triggering a job at fixed batch interval? 2) If it is not possible with Spark-streaming, can we control this with Kafka/Flume Thanks, Sivaram

Re: Spark Streaming - dividing DStream into mini batches

2016-09-15 Thread Cody Koeninger
ethods in DirectKafkaInputDStream and KafkaRDD >> respectively if you want to see an example >> >> On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie <debie.d...@gmail.com> wrote: >> > Ah, that makes it much clearer, thanks! >> > >> > It also

Re: Spark Streaming - dividing DStream into mini batches

2016-09-15 Thread Daan Debie
; It also brings up an additional question: who/what decides on the > > partitioning? Does Spark Streaming decide to divide a micro batch/RDD > into > > more than 1 partition based on size? Or is it something that the "source" > > (SocketStream, KafkaStream etc.) de

Re: Reading the most recent text files created by Spark streaming

2016-09-15 Thread Mich Talebzadeh
e reliability mechanisms and has been explicitly > designed for import/export and is tested. Not sure if i would go for spark > streaming if the use case is only storing, but I do not have the full > picture of your use case. > > Anyway, what you could do is create a directory / hour

The coming data on Spark Streaming

2016-09-14 Thread pcandido
Hi everyone, I'm starting in Spark Streaming and would like to know somethings about data arriving. I know that SS uses micro-batches and they are received by workers and sent to RDD. The master, on defined intervals, receives a poiter to micro-batch in RDD and can use it to process data using

Re: Reading the most recent text files created by Spark streaming

2016-09-14 Thread Jörn Franke
Hi, An alternative to Spark could be flume to store data from Kafka to HDFS. It provides also some reliability mechanisms and has been explicitly designed for import/export and is tested. Not sure if i would go for spark streaming if the use case is only storing, but I do not have the full

Reading the most recent text files created by Spark streaming

2016-09-14 Thread Mich Talebzadeh
Hi, I have a Spark streaming that reads messages/prices from Kafka and writes it as text file to HDFS. This is pretty efficient. Its only function is to persist the incoming messages to HDFS. This is what it does dstream.foreachRDD { pricesRDD => val x= pricesRDD.co

Re: Spark Streaming - dividing DStream into mini batches

2016-09-14 Thread Daan Debie
t makes it much clearer, thanks! > > > > It also brings up an additional question: who/what decides on the > > partitioning? Does Spark Streaming decide to divide a micro batch/RDD > into > > more than 1 partition based on size? Or is it something that the "source&quo

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Daan Debie
Ah, that makes it much clearer, thanks! It also brings up an additional question: who/what decides on the partitioning? Does Spark Streaming decide to divide a micro batch/RDD into more than 1 partition based on size? Or is it something that the "source" (SocketStream, KafkaStream etc

Re: Spark Streaming - dividing DStream into mini batches

2016-09-13 Thread Cody Koeninger
an example On Tue, Sep 13, 2016 at 9:37 AM, Daan Debie <debie.d...@gmail.com> wrote: > Ah, that makes it much clearer, thanks! > > It also brings up an additional question: who/what decides on the > partitioning? Does Spark Streaming decide to divide a micro batch/RDD into > more t

<    5   6   7   8   9   10   11   12   13   14   >