Yarn log aggregation of spark streaming job

2018-09-24 Thread ayushChauhan
By default, YARN aggregates logs after an application completes. But I am trying aggregate logs for spark streaming job which in theory will run forever. I have set the property the following properties for log aggregation and restarted yarn by restarting hadoop-yarn-nodemanager on core &

Spark Streaming RDD Cleanup too slow

2018-09-05 Thread Prashant Sharma
I have a Spark Streaming job which takes too long to delete temp RDD's. I collect about 4MM telemetry metrics per minute and do minor aggregations in the Streaming Job. I am using Amazon R4 instances. The Driver RPC call although Async,i believe, is slow getting the handle for future object

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-30 Thread Cody Koeninger
8 at 2:10 AM, Guillermo Ortiz Fernández >> wrote: >> > I'm using Spark Streaming 2.0.1 with Kafka 0.10, sometimes I get this >> > exception and Spark dies. >> > >> > I couldn't see any error or problem among the machines, anybody has the >> >

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Guillermo Ortiz Fernández
ava:1091) > > ~[kafka-clients-1.0.0.jar:na] > > at > > > org.apache.spark.streaming.kafka010.DirectKafkaInputDStream.paranoidPoll(DirectKafkaInputDStream.scala:169) > > ~[spark-streaming-kafka-0-10_2.11-2.0.2.jar:2.0.2] > > at > > > org.apache.spa

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Cody Koeninger
Are you able to try a recent version of spark? On Wed, Aug 29, 2018 at 2:10 AM, Guillermo Ortiz Fernández wrote: > I'm using Spark Streaming 2.0.1 with Kafka 0.10, sometimes I get this > exception and Spark dies. > > I couldn't see any error or problem among the machines,

Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Guillermo Ortiz Fernández
I'm using Spark Streaming 2.0.1 with Kafka 0.10, sometimes I get this exception and Spark dies. I couldn't see any error or problem among the machines, anybody has the reason about this error? java.lang.IllegalStateException: This consumer has already been closed

Re: [Spark Streaming] [ML]: Exception handling for the transform method of Spark ML pipeline model

2018-08-17 Thread sudododo
Hi, Any help on this? Thanks, -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark Streaming] [ML]: Exception handling for the transform method of Spark ML pipeline model

2018-08-16 Thread sudododo
Hi, I'm implementing a Spark Streaming + ML application. The data is coming in a Kafka topic as json format. The Spark Kafka connector reads the data from the Kafka topic as DStream. After several preprocessing steps, the input DStream is transformed to a feature DStream which is fed into Spark

Re: How to convert Spark Streaming to Static Dataframe on the fly and pass it to a ML Model as batch

2018-08-14 Thread Gourav Sengupta
Hi, or you could just use the structured streaming https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Regards, Gourav Sengupta On Tue, Aug 14, 2018 at 10:51 AM, Gerard Maas wrote: > Hi Aakash, > > In Spark Streaming, forEachRDD provides you access to

Sending data from ZeroMQ to Spark Streaming API with Python

2018-08-14 Thread oreogundipe
Hi! I'm working on a project and I'm trying to find out if I can pass data from my zeroMQ straight into python's streaming API. I saw some links but I didn't see anything concrete as to how to use it with python. Can anybody please point me in the right direction? -- Sent from:

Re: How to convert Spark Streaming to Static Dataframe on the fly and pass it to a ML Model as batch

2018-08-14 Thread Gerard Maas
Hi Aakash, In Spark Streaming, forEachRDD provides you access to the data in each micro batch. You can transform that RDD into a DataFrame and implement the flow you describe. eg.: var historyRDD:RDD[mytype] = sparkContext.emptyRDD // create Kafka Dstream ... dstream.foreachRDD{ rdd =>

How to convert Spark Streaming to Static Dataframe on the fly and pass it to a ML Model as batch

2018-08-14 Thread Aakash Basu
Hi all, The requirement is, to process file using Spark Streaming fed from Kafka Topic and once all the transformations are done, make it a batch of static dataframe and pass it into a Spark ML Model tuning. As of now, I had been doing it in the below fashion - 1) Read the file using Kafka 2

How does mapPartitions function work in Spark streaming on DStreams?

2018-08-09 Thread zakhavan
Hello, I am running a spark streaming program on a dataset which is a sequence of numbers in a text file format. I read the text file and load it into a Kafka topic and then run the Spark streaming program on the DStream and finally write the result into an output text file. But I'm getting

Re: How to do PCA with Spark Streaming Dataframe?

2018-07-31 Thread Aakash Basu
FYI The relevant StackOverflow query on the same - https://stackoverflow.com/questions/51610482/how-to-do-pca-with-spark-streaming-dataframe On Tue, Jul 31, 2018 at 3:18 PM, Aakash Basu wrote: > Hi, > > Just curious to know, how can we run a Principal Component Analysis on > st

How to do PCA with Spark Streaming Dataframe?

2018-07-31 Thread Aakash Basu
Hi, Just curious to know, how can we run a Principal Component Analysis on streaming data in distributed mode? If we can, is it mathematically valid enough? Have anyone done that before? Can you guys share your experience over it? Is there any API Spark provides to do the same on Spark Streaming

Using Spark Streaming for analyzing changing data

2018-07-30 Thread oripwk
pe of data in Spark Streaming? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Question of spark streaming

2018-07-27 Thread Arun Mahadevan
. Thanks, Arun From: utkarsh rathor Date: Friday, July 27, 2018 at 5:15 AM To: "user@spark.apache.org" Subject: Question of spark streaming I am following the book Spark the Definitive Guide The following code is executed locally using spark-shell Procedure: Started the s

Question of spark streaming

2018-07-27 Thread utkarsh rathor
I am following the book *Spark the Definitive Guide* The following code is *executed locally using spark-shell* Procedure: Started the spark-shell without any other options val static = spark.read.json("/part-00079-tid-730451297822678341-1dda7027-2071-4d73-a0e2-7fb6a91e1d1f-0-c000.json") val

How dose spark streaming program call python file

2018-07-25 Thread 康逸之
I am trying to build a real-time system with spark (written with scala), but here are some algorithm file written in python. How can i call the algorithm file ? Any idea how to let it work?

***UNCHECKED*** How dose spark streaming program (written with scala)call python file

2018-07-25 Thread 康逸之
I am trying to build a real-time system with spark (written with scala), but here are some algorithm file written in python. How can i call the algorithm file ? Any idea how to let it work?

Spark streaming connecting to two kafka clusters

2018-07-17 Thread Sathi Chowdhury
Hi,My question is about ability to integrate spark streaming with multiple clusters.Is it a supported use case. An example of that is that two topics owned by different group and they have their own kakka infra .Can i have two dataframes as a result of spark.readstream listening to different

Re: Stopping a Spark Streaming Context gracefully

2018-07-15 Thread Dhaval Modi
+1 Regards, Dhaval Modi dhavalmod...@gmail.com On 8 November 2017 at 00:06, Bryan Jeffrey wrote: > Hello. > > I am running Spark 2.1, Scala 2.11. We're running several Spark streaming > jobs. In some cases we restart these jobs on an occasional basis. We have > code

Run STA/LTA python function using spark streaming: java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute

2018-07-10 Thread zakhavan
Hello, I'm trying to run the sta/lta python code which I got it from obspy website using spark streaming and plot the events but I keep getting the following error! "java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute" Here i

Re: [Spark Streaming MEMORY_ONLY] Understanding Dataflow

2018-07-05 Thread Thomas Lavocat
Excerpts from Prem Sure's message of 2018-07-04 19:39:29 +0530: > Hoping below would help in clearing some.. > executors dont have control to share the data among themselves except > sharing accumulators via driver's support. > Its all based on the data locality or remote nature, tasks/stages are

Re: [Spark Streaming MEMORY_ONLY] Understanding Dataflow

2018-07-04 Thread Prem Sure
Hoping below would help in clearing some.. executors dont have control to share the data among themselves except sharing accumulators via driver's support. Its all based on the data locality or remote nature, tasks/stages are defined to perform which may result in shuffle. On Wed, Jul 4, 2018 at

[Spark Streaming MEMORY_ONLY] Understanding Dataflow

2018-07-04 Thread thomas lavocat
Hello, I have a question on Spark Dataflow. If I understand correctly, all received data is sent from the executor to the driver of the application prior to task creation. Then the task embeding the data transit from the driver to the executor in order to be processed. As executor cannot

union of multiple twitter streams [spark-streaming-twitter_2.11]

2018-07-02 Thread Imran Rajjad
Hello, Has anybody tried to union two streams of Twitter Statues? I am instantiating two twitter streams through two different set of credentials and passing them through a union function, but the console does not show any activity neither there are any errors. --static function that returns

Spark Streaming PID rate controller minRate default value

2018-06-29 Thread faxianzhao
Hi, there I think you should set "spark.streaming.backpressure.pid.minRate" as "no set" like "spark.streaming.backpressure.initialRate". The default value 100 is not good for my business. It's better to explain it more detail in document, and let user make decision by himself like

Re: spark 2.3.1 with kafka spark-streaming-kafka-0-10 (java.lang.AbstractMethodError)

2018-06-28 Thread Peter Liu
Hello there, I just upgraded to spark 2.3.1 from spark 2.2.1, ran my streaming workload and got the error (java.lang.AbstractMethodError) never seen before; check the error stack attached in (a) bellow. anyone knows if spark 2.3.1 works well with kafka spark-streaming-kafka-0-10? this link

Re: [Spark Streaming] Spark Streaming with S3 vs Kinesis

2018-06-28 Thread Farshid Zavareh
this? On Wed, Jun 27, 2018 at 12:26 AM Steve Loughran wrote: > > On 25 Jun 2018, at 23:59, Farshid Zavareh wrote: > > I'm writing a Spark Streaming application where the input data is put into > an S3 bucket in small batches (using Database Migration Service - DMS). The > Spark appli

Re: [Spark Streaming] Spark Streaming with S3 vs Kinesis

2018-06-26 Thread Steve Loughran
On 25 Jun 2018, at 23:59, Farshid Zavareh mailto:fhzava...@gmail.com>> wrote: I'm writing a Spark Streaming application where the input data is put into an S3 bucket in small batches (using Database Migration Service - DMS). The Spark application is the only consumer. I'm consideri

Re: [Spark Streaming] Measure latency

2018-06-26 Thread Gerard Maas
want to check the performance there first. If you want to add the stream ingestion time to this, it gets a bit more tricky. kind regards, Gerard. On Tue, Jun 26, 2018 at 11:49 AM Daniele Foroni wrote: > Hi all, > > I am using spark streaming and I need to evaluate the latency of the > standa

[Spark Streaming] Measure latency

2018-06-26 Thread Daniele Foroni
Hi all, I am using spark streaming and I need to evaluate the latency of the standard aggregations (avg, min, max, …) provided by the spark APIs. Any way to do it in the code? Thanks in advance, --- Daniele

[Spark Streaming] Spark Streaming with S3 vs Kinesis

2018-06-25 Thread Farshid Zavareh
I'm writing a Spark Streaming application where the input data is put into an S3 bucket in small batches (using Database Migration Service - DMS). The Spark application is the only consumer. I'm considering two possible architectures: Have Spark Streaming watch an S3 prefix and pick up new

[Spark Streaming] Are SparkListener/StreamingListener callbacks called concurrently?

2018-06-20 Thread Majid Azimi
Hi, What is the concurrency model behind SparkListener or StreamingListener callbacks? 1. Multiple threads might access callbacks simultaneously. 2. Callbacks are guaranteed to be executed by a single thread.(Thread ids might change on consecutive calls, though) I asked the same question on

[Spark Streaming]: How do I apply window before filter?

2018-06-11 Thread Tejas Manohar
Hey friends, We're trying to make some batched computations run against an OLAP DB closer to "realtime". One of our more complex computations is a trigger when event A occurs but not event B within a given time period. Our experience with Spark is limited, but since Spark 2.3.0 just introduced

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-11 Thread thomas lavocat
red by user, unless you're familiar with Spark Streaming internals, and know the implication of this configuration. How can I find some documentation about those implications ? I've experimented some configuration of this parameters and found out that

[Spark Streaming] Distinct Count on unrelated columns

2018-06-06 Thread Aakash Basu
Hi guys, Posted a question (link) on StackOverflow, any help? Thanks, Aakash.

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread Saisai Shao
homas lavocat 于2018年6月5日周二 > 下午7:17写道: > >> Hello, >> >> Thank's for your answer. >> >> On 05/06/2018 11:24, Saisai Shao wrote: >> >> spark.streaming.concurrentJobs is a driver side internal configuration, >> this means that how many streaming jobs ca

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread thomas lavocat
hao wrote: spark.streaming.concurrentJobs is a driver side internal configuration, this means that how many streaming jobs can be submitted concurrently in one batch. Usually this should not be configured by user, unless you're familiar with Spark Streaming internals,

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread Saisai Shao
; > On 05/06/2018 11:24, Saisai Shao wrote: > > spark.streaming.concurrentJobs is a driver side internal configuration, > this means that how many streaming jobs can be submitted concurrently in > one batch. Usually this should not be configured by user, unless you're > familiar

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread thomas lavocat
familiar with Spark Streaming internals, and know the implication of this configuration. How can I find some documentation about those implications ? I've experimented some configuration of this parameters and found out that my overall throughput is increased in correlation with this property

Re: [Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread Saisai Shao
spark.streaming.concurrentJobs is a driver side internal configuration, this means that how many streaming jobs can be submitted concurrently in one batch. Usually this should not be configured by user, unless you're familiar with Spark Streaming internals, and know the implication

[Spark Streaming] is spark.streaming.concurrentJobs a per node or a cluster global value ?

2018-06-05 Thread thomas lavocat
Hi everyone, I'm wondering if the property  spark.streaming.concurrentJobs should reflects the total number of possible concurrent task on the cluster, or the a local number of concurrent tasks on one compute node. Thanks for your help. Thomas

How to work around NoOffsetForPartitionException when using Spark Streaming

2018-06-01 Thread Martin Peng
Hi, We see below exception when using Spark Kafka streaming 0.10 on a normal Kafka topic. Not sure why offset missing in zk, but since Spark streaming override the offset reset policy to none in the code. I can not set the reset policy to latest(I don't really care data loss now). Is there any

Spark streaming with kafka input stuck in (Re-)joing group because of group rebalancing

2018-05-15 Thread JF Chen
When I terminate a spark streaming application and restart it, it always stuck in this step: > > Revoking previously assigned partitions [] for group [mygroup] > (Re-)joing group [mygroup] If I use a new group id, even though it works fine, I may lose the data from the last ti

Making spark streaming application single threaded

2018-05-09 Thread ravidspark
Hi All, Is there any property which makes my spark streaming application a single threaded? I researched on this property, *spark.dynamicAllocation.maxExecutors=1*, but as far as I understand this launches a maximum of one container but not a single thread. In local mode, we can configure

Re: [Spark Streaming]: Does DStream workload run over Spark SQL engine?

2018-05-02 Thread Saisai Shao
xecution engine of Spark Streaming > (DStream API): Does Spark streaming jobs run over the Spark SQL engine? > > For example, if I change a configuration parameter related to Spark SQL > (like spark.sql.streaming.minBatchesToRetain or > spark.sql.objectHashAggregate.sortBased.fallbackTh

[Spark Streaming]: Does DStream workload run over Spark SQL engine?

2018-05-02 Thread Khaled Zaouk
Hi, I have a question regarding the execution engine of Spark Streaming (DStream API): Does Spark streaming jobs run over the Spark SQL engine? For example, if I change a configuration parameter related to Spark SQL (like spark.sql.streaming.minBatchesToRetain

re: spark streaming / AnalysisException on collect()

2018-04-30 Thread Peter Liu
Hello there, I have a quick question regarding how to share data (a small data collection) between a kafka producer and consumer using spark streaming (spark 2.2): (A) the data published by a kafka producer is received in order on the kafka consumer side (see (a) copied below). (B) however

Spark Streaming for more file types

2018-04-27 Thread रविशंकर नायर
e: (String, org.apache.spark.input.PortableDataStream) ) : Unit = { //Code to interact with IBM Datamap OCR which converts the PDF files into text } I do want to change the above code to Spark streaming. Unfortunately there is ( definitely the would be a great addition to Spark) No &q

Re: schema change for structured spark streaming using jsonl files

2018-04-25 Thread Michael Segel
of the JSON doc and then storing highlighted attributes in separate columns. HTH -Mike > On Apr 23, 2018, at 1:46 PM, Lian Jiang <jiangok2...@gmail.com> wrote: > > Hi, > > I am using structured spark streaming which reads jsonl files and writes into > parquet file

Re: schema change for structured spark streaming using jsonl files

2018-04-24 Thread Lian Jiang
Thanks for any help! On Mon, Apr 23, 2018 at 11:46 AM, Lian Jiang <jiangok2...@gmail.com> wrote: > Hi, > > I am using structured spark streaming which reads jsonl files and writes > into parquet files. I am wondering what's the process if jsonl files schema > change. &g

schema change for structured spark streaming using jsonl files

2018-04-23 Thread Lian Jiang
Hi, I am using structured spark streaming which reads jsonl files and writes into parquet files. I am wondering what's the process if jsonl files schema change. Suppose jsonl files are generated in \jsonl folder and the old schema is { "field1": String}. My proposal is: 1. write the j

Re: How to bulk insert using spark streaming job

2018-04-19 Thread scorpio
You need to insert per partition per batch. Normally database drivers meant for spark have bulk update feature built in. They take a RDD and do a bulk insert per partition. In case db driver you are using doesn't provide this feature, you can aggregate records per partition and then send out to db

Re: How to bulk insert using spark streaming job

2018-04-19 Thread ayan guha
iem...@gmail.com> wrote: > How to bulk insert using spark streaming job > > Sent from my iPhone > -- Best Regards, Ayan Guha

How to bulk insert using spark streaming job

2018-04-19 Thread amit kumar singh
How to bulk insert using spark streaming job Sent from my iPhone

Re: In spark streaming application how to distinguish between normal and abnormal termination of application?

2018-04-15 Thread Igor Makhlin
looks like nobody knows the answer on this question ;) On Sat, Mar 31, 2018 at 1:59 PM, Igor Makhlin <igor.makh...@gmail.com> wrote: > Hi All, > > I'm looking for a way to distinguish between normal and abnormal > termination of a spark streaming application with (che

Re: Testing spark streaming action

2018-04-10 Thread Jörn Franke
Run it as part of integration testing, you can still use scala test but with a different sub folder (it or integrationtest) instead of test. Within integrationtest you create a local Spark server that has also accumulators. > On 10. Apr 2018, at 17:35, Guillermo Ortiz

Testing spark streaming action

2018-04-10 Thread Guillermo Ortiz
I have a unitTest in SparkStreaming which has an input parameters. -DStream[String] Inside of the code I want to update an LongAccumulator. When I execute the test I get an NullPointerException because the accumulator doesn't exist. Is there any way to test this? My accumulator is updated in

In spark streaming application how to distinguish between normal and abnormal termination of application?

2018-03-31 Thread Igor Makhlin
Hi All, I'm looking for a way to distinguish between normal and abnormal termination of a spark streaming application with (checkpointing enabled). Adding application listener doesn't really help because onApplicationEnd event has no information regarding the cause of the termination. ssc_

Transaction Examplefor spark streaming in Spark2.2

2018-03-22 Thread KhajaAsmath Mohammed
Hi Cody, I am following to implement the exactly once semantics and also utilize storing the offsets in database. Question I have is how to use hive instead of traditional datastores. write to hive will be successful even though there is any issue with saving offsets into DB. Could you please

Wait for 30 seconds before terminating Spark Streaming

2018-03-21 Thread Aakash Basu
Hi, Using: *Spark 2.3 + Kafka 0.10* How to wait for 30 seconds after the latest stream and if there's no more streaming data, gracefully exit. Is it running - query.awaitTermination(30) Or is it something else? I tried with this, keeping - option("startingOffsets", "latest") for both my

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-21 Thread Aakash Basu
> and compact data format if CSV isn't required. > > -- > *From:* Aakash Basu <aakash.spark@gmail.com> > *Sent:* Friday, March 16, 2018 9:12:39 AM > *To:* sagar grover > *Cc:* Bowden, Chris; Tathagata Das; Dylan Guedes; Georg Heiler; user; > jagrati.go...@myntra.com

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-16 Thread Aakash Basu
;>>> >>>> Cool! Shall try it and revert back tomm. >>>> >>>> Thanks a ton! >>>> >>>> On 15-Mar-2018 11:50 PM, "Bowden, Chris" <chris.bow...@microfocus.com> >>>> wrote: >>>> >>>>>

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-16 Thread Aakash Basu
nd >>>> deserialization is often an orthogonal and implicit transform. However, in >>>> Spark, serialization and deserialization is an explicit transform (e.g., >>>> you define it in your query plan). >>>> >>>> >>>> To make this mo

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-16 Thread Aakash Basu
t;> offers from_csv out of the box as an expression (although CSV is well >>> supported as a data source). You could implement an expression by reusing a >>> lot of the supporting CSV classes which may result in a better user >>> experience vs. explicitly using split

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-16 Thread sagar grover
may result in a better user >> experience vs. explicitly using split and array indices, etc. In this >> simple example, casting the binary to a string just works because there is >> a common understanding of string's encoded as bytes between Spark and Kafka >> by default. >&g

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
--- > *From:* Aakash Basu <aakash.spark@gmail.com> > *Sent:* Thursday, March 15, 2018 10:48:45 AM > *To:* Bowden, Chris > *Cc:* Tathagata Das; Dylan Guedes; Georg Heiler; user > *Subject:* Re: Multiple Kafka Spark Streaming Dataframe Join query > > Hey Chris, > &g

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
com> Sent: Thursday, March 15, 2018 7:52:28 AM To: Tathagata Das Cc: Dylan Guedes; Georg Heiler; user Subject: Re: Multiple Kafka Spark Streaming Dataframe Join query Hi, And if I run this below piece of code - from pyspark.sql import SparkSession import time class test: spark

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Tathagata Das
; From: Aakash Basu <aakash.spark@gmail.com> > Sent: Thursday, March 15, 2018 7:52:28 AM > To: Tathagata Das > Cc: Dylan Guedes; Georg Heiler; user > Subject: Re: Multiple Kafka Spark Streaming Dataframe Join query > > Hi, > > And if I run this below piece of code - > >

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
://databricks.com/blog/2018/03/13/introducing >>>> -stream-stream-joins-in-apache-spark-2-3.html >>>> >>>> This is true stream-stream join which will automatically buffer delayed >>>> data and appropriately join stuff with SQL join semantics. Please check it >>>> out :) >&g

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
; >>>> On Wed, Mar 14, 2018 at 3:58 PM, Aakash Basu < >>>> aakash.spark....@gmail.com> wrote: >>>> >>>>> Hey Dylan, >>>>> >>>>> Great! >>>>> >>>>> Can you revert back to my initial

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
est mail? >>>> >>>> Thanks, >>>> Aakash. >>>> >>>> On 15-Mar-2018 12:27 AM, "Dylan Guedes" <djmggue...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>>

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
t want to know, when does Spark 2.3 with 0.10 Kafka Spark Package >>>>> allows Python? I read somewhere, as of now Scala and Java are the >>>>> languages >>>>> to be used. >>>>> >>>>> Please correct me if am wrong. >&g

Re: How to start practicing Python Spark Streaming in Linux?

2018-03-14 Thread Felix Cheung
ark@gmail.com> Sent: Wednesday, March 14, 2018 1:09 AM Subject: How to start practicing Python Spark Streaming in Linux? To: user <user@spark.apache.org> Hi all, Any guide on how to kich-start learning PySpark Streaming in ubuntu standalone system? Step wise, practical hands-on, wo

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Tathagata Das
; Aakash. >>>> >>>> On 14-Mar-2018 8:24 PM, "Georg Heiler" <georg.kf.hei...@gmail.com> >>>> wrote: >>>> >>>>> Did you try spark 2.3 with structured streaming? There watermarking >>>>> and plain sql might be r

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Dylan Guedes
>> Please correct me if am wrong. >>> >>> Thanks, >>> Aakash. >>> >>> On 14-Mar-2018 8:24 PM, "Georg Heiler" <georg.kf.hei...@gmail.com> >>> wrote: >>> >>>> Did you try spark 2.3 with structured streami

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
ing and >>> plain sql might be really interesting for you. >>> Aakash Basu <aakash.spark@gmail.com> schrieb am Mi. 14. März 2018 >>> um 14:57: >>> >>>> Hi, >>>> >>>> >>>> >>>> *Info (Using):Spark Str

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Dylan Guedes
ermarking and >> plain sql might be really interesting for you. >> Aakash Basu <aakash.spark@gmail.com> schrieb am Mi. 14. März 2018 um >> 14:57: >> >>> Hi, >>> >>> >>> >>> *Info (Using):Spark Streaming Kafka 0.8 pac

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
il.com> wrote: > Did you try spark 2.3 with structured streaming? There watermarking and > plain sql might be really interesting for you. > Aakash Basu <aakash.spark@gmail.com> schrieb am Mi. 14. März 2018 um > 14:57: > >> Hi, >> >> >> >>

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Georg Heiler
Did you try spark 2.3 with structured streaming? There watermarking and plain sql might be really interesting for you. Aakash Basu <aakash.spark@gmail.com> schrieb am Mi. 14. März 2018 um 14:57: > Hi, > > > > *Info (Using):Spark Streaming Kafka 0.8 package* > > *

Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Hi, *Info (Using):Spark Streaming Kafka 0.8 package* *Spark 2.2.1* *Kafka 1.0.1* As of now, I am feeding paragraphs in Kafka console producer and my Spark, which is acting as a receiver is printing the flattened words, which is a complete RDD operation. *My motive is to read two tables

How to start practicing Python Spark Streaming in Linux?

2018-03-14 Thread Aakash Basu
Hi all, Any guide on how to kich-start learning PySpark Streaming in ubuntu standalone system? Step wise, practical hands-on, would be great. Also, connecting Kafka with Spark and getting real time data and processing it in micro-batches... Any help? Thanks, Aakash.

Spark Streaming logging on Yarn : issue with rolling in yarn-client mode for driver log

2018-03-07 Thread chandan prakash
Hi All, I am running my spark streaming in yarn-client mode. I want to enable rolling and aggregation in node manager container. I am using configs as suggested in spark doc <https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application>: ${spark.yarn.app.container.l

Spark Streaming reading many topics with Avro

2018-03-02 Thread Guillermo Ortiz
Hello, I want to read with a single Spark Streaming process several topics. I'm using avro and the data to the different topics have a different schema.Ideally, If I would only have one topic I could implement a deserializer but, I don't know if it's possible with many different schemas. val

Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-02-23 Thread vijay.bvp
thanks for adding RDD lineage graph. I could see 18 parallel tasks for HDFS Read was it changed. what is the spark job configuration, how many executors and cores per exeuctor i would say keep the partitioning multiple of (no of executors * cores) for all the RDD's if you have 3 executors

Re: Consuming Data in Parallel using Spark Streaming

2018-02-22 Thread naresh Goud
entiate records of one type of entity from other type of >entities. > > > > -Beejal > > > > *From:* naresh Goud [mailto:nareshgoud.du...@gmail.com] > *Sent:* Friday, February 23, 2018 8:56 AM > *To:* Vibhakar, Beejal <beejal.vibha...@fisglobal.com> >

RE: Consuming Data in Parallel using Spark Streaming

2018-02-22 Thread Vibhakar, Beejal
records of one type of entity from other type of entities. -Beejal From: naresh Goud [mailto:nareshgoud.du...@gmail.com] Sent: Friday, February 23, 2018 8:56 AM To: Vibhakar, Beejal <beejal.vibha...@fisglobal.com> Subject: Re: Consuming Data in Parallel using Spark Streaming You will have th

Consuming Data in Parallel using Spark Streaming

2018-02-21 Thread Vibhakar, Beejal
I am trying to process data from 3 different Kafka topics using 3 InputDStream with a single StreamingContext. I am currently testing this under Sandbox where I see data processed from one Kafka topic followed by other. Question#1: I want to understand that when I run this program in Hadoop

Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-02-20 Thread LongVehicle
Hi Vijay, Thanks for the follow-up. The reason why we have 90 HDFS files (causing the parallelism of 90 for HDFS read stage) is because we load the same HDFS data in different jobs, and these jobs have parallelisms (executors X cores) of 9, 18, 30. The uneven assignment problem that we had

Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-02-19 Thread vijay.bvp
apologies for the long answer. understanding partitioning at each stage of the the RDD graph/lineage is important for efficient parallelism and having load balanced. This applies to working with any sources streaming or static. you have tricky situation here of one source kafka with 9

Re: [Spark Streaming]: Non-deterministic uneven task-to-machine assignment

2018-02-19 Thread Aleksandar Vitorovic
Hi Vijay, Thank you very much for your reply. Setting the number of partitions explicitly in the join, and memory pressure influence on partitioning were definitely very good insights. At the end, we avoid the issue of uneven load balancing completely by doing the following two: a) Reducing the

Re: Spark Streaming withWatermark

2018-02-06 Thread Tathagata Das
That may very well be possible. The watermark delay guarantees that any record newer than or equal to watermark (that is, max event time seen - 20 seconds), will be considered and never be ignored. It does not guarantee the other way, that is, it does NOT guarantee that records older than the

Re: Spark Streaming withWatermark

2018-02-06 Thread Vishnu Viswanath
Could it be that these messages were processed in the same micro batch? In that case, watermark will be updated only after the batch finishes which did not have any effect of the late data in the current batch. On Tue, Feb 6, 2018 at 4:18 PM Jiewen Shao wrote: > Ok,

Re: Spark Streaming withWatermark

2018-02-06 Thread Jiewen Shao
Ok, Thanks for confirmation. So based on my code, I have messages with following timestamps (converted to more readable format) in the following order: 2018-02-06 12:00:00 2018-02-06 12:00:01 2018-02-06 12:00:02 2018-02-06 12:00:03 2018-02-06 11:59:00 <-- this message should not be counted,

Re: Spark Streaming withWatermark

2018-02-06 Thread Vishnu Viswanath
Yes, that is correct. On Tue, Feb 6, 2018 at 4:56 PM, Jiewen Shao wrote: > Vishnu, thanks for the reply > so "event time" and "window end time" have nothing to do with current > system timestamp, watermark moves with the higher value of "timestamp" > field of the input

Re: Spark Streaming withWatermark

2018-02-06 Thread Vishnu Viswanath
Hi 20 second corresponds to when the window state should be cleared. For the late message to be dropped, it should come in after you receive a message with event time >= window end time + 20 seconds. I wrote a post on this recently:

Spark Streaming withWatermark

2018-02-06 Thread Jiewen Shao
sample code: Let's say Xyz is POJO with a field called timestamp, regarding code withWatermark("timestamp", "20 seconds") I expect the msg with timestamp 20 seconds or older will be dropped, what does 20 seconds compare to? based on my test nothing was dropped no matter how old the timestamp

Re: Prefer Structured Streaming over Spark Streaming (DStreams)?

2018-02-02 Thread Biplob Biswas
Great to hear 2 different viewpoints, and thanks a lot for your input Michael. For now, our application perform an etl process where it reads data from kafka and stores it in HBase and then performs basic enhancement and pushes data out on a kafka topic. We have a conflict of opinion here as few

<    1   2   3   4   5   6   7   8   9   10   >