Re: [Spark Structured Streaming on K8S]: Debug - File handles/descriptor (unix pipe) leaking

2018-07-25 Thread Yuval.Itzchakov
We're experiencing the exact same issue while running load tests on Spark 2.3.1 with Structured Streaming and `mapGroupsWithState`. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Re: Does mapWithState need checkpointing to be specified in Spark Streaming?

2017-07-16 Thread Yuval.Itzchakov
Yes, you do. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-mapWithState-need-checkpointing-to-be-specified-in-Spark-Streaming-tp28858p28862.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to reduce the amount of data that is getting written to the checkpoint from Spark Streaming

2017-07-03 Thread Yuval.Itzchakov
Using a long period betweem checkpoints may cause a long linage of the graphs computations to be created, since Spark uses checkpointing to cut it, which can also cause a delay in the streaming job. -- View this message in context:

Structured Streaming UI similar to Spark Streaming

2017-07-02 Thread Yuval.Itzchakov
Hi, Today, Spark Streaming exposes an extensive, detailed graphs of input rate, processing time and delay. I was wondering, is there any plan to integrate such a graph for Structured Streaming? Now with Kafka support and implementation of stateful aggregations in Spark 2.2, it's becoming a very

Re: How to reduce the amount of data that is getting written to the checkpoint from Spark Streaming

2017-07-02 Thread Yuval.Itzchakov
You can't. Spark doesn't let you fiddle with the data being checkpoint, as it's an internal implementation detail. -- View this message in context:

Stateful aggregations with Structured Streaming

2016-11-19 Thread Yuval.Itzchakov
I've been using `DStream.mapWithState` and was looking forward to trying out Structured Streaming. The thing I can't under is, does Structured Streaming in it's current state support stateful aggregations? Looking at the StateStore design document

Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Yuval.Itzchakov
I've been reading/watching videos about the upcoming Spark 2.0 release which brings us Structured Streaming. One thing I've yet to understand is how this relates to the current state of working with Streaming in Spark with the DStream abstraction. All examples I can find, in the Spark

Evicting a lower version of a library loaded in Spark Worker

2016-04-03 Thread Yuval.Itzchakov
My code uses "com.typesafe.config" in order to read configuration values. Currently, our code uses v1.3.0, whereas Spark uses 1.2.1 internally. When I initiate a job, the worker process invokes a method in my code but fails, because it's defined abstract in v1.2.1 whereas in v1.3.0 it is not. The

Re: value saveToCassandra is not a member of org.apache.spark.streaming.dstream.DStream[(String, Int)]

2016-04-03 Thread Yuval.Itzchakov
You need to import com.datastax.spark.connector.streaming._ to have the methods available. https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md -- View this message in context:

Setting up log4j2/logback with Spark 1.6.0

2016-03-19 Thread Yuval.Itzchakov
I've been trying to get log4j2 and logback to get to play nice with Spark 1.6.0 so I can properly offload my logs to a remote server. I've attempted the following things: 1. Setting logback/log4j2 on the class path for both the driver and worker nodes 2. Passing -Dlog4j.configurationFile= and

Using dynamic allocation and shuffle service in Standalone Mode

2016-03-08 Thread Yuval.Itzchakov
Hi, I'm using Spark 1.6.0, and according to the documentation, dynamic allocation and spark shuffle service should be enabled. When I submit a spark job via the following: spark-submit \ --master \ --deploy-mode cluster \ --executor-cores 3 \ --conf

Re: Spark Streaming fileStream vs textFileStream

2016-03-06 Thread Yuval.Itzchakov
I dont think the documentation can be anymore descriptive: /** * Create a input stream that monitors a Hadoop-compatible filesystem * for new files and reads them using the given key-value types and input format. * Files must be written to the monitored directory by "moving" them from

Continuous deployment to Spark Streaming application with sessionization

2016-03-06 Thread Yuval.Itzchakov
I've been recently thinking about continuous deployment to our spark streaming service. We have a streaming application which does sessionization via `mapWithState`, aggregating sessions in memory until they are ready to be deployed. Now, as I see things we have two use cases here: 1. Spark

Using a non serializable third party JSON serializable on a spark worker node throws NotSerializableException

2016-03-01 Thread Yuval.Itzchakov
I have a small snippet of code which relays on argonaut for JSON serialization which is ran from a `PairRDDFunctions.mapWithState` once a session is completed. This is the code snippet (not that important): override def sendMessage(pageView: PageView): Unit = {

Re: Spark jobs run extremely slow on yarn cluster compared to standalone spark

2016-02-14 Thread Yuval.Itzchakov
Your question lacks sufficient information for us to actually provide help. Have you looked at the Spark UI to see which part of the graph is taking the longest? Have you tried logging your methods? -- View this message in context:

Re: how can i write map(x => x._1 ++ x._2) statement in python.??

2016-02-08 Thread Yuval.Itzchakov
In python, concatenating two lists can be done simply using the + operator. I'm assuming the RDD you're using map over consists of a tuple: map(lambda x: x[0] + x[1]) -- View this message in context:

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread Yuval.Itzchakov
I would definitely try to avoid hosting Kafka and Spark on the same servers. Kafka and Spark will be doing alot of IO between them, so you'll want to maximize on those resources and not share them on the same server. You'll want each Kafka broker to be on a dedicated server, as well as your

PairDStreamFunctions.mapWithState fails in case timeout is set without updating State[S]

2016-02-04 Thread Yuval.Itzchakov
Hi, I've been playing with the expiramental PairDStreamFunctions.mapWithState feature and I've seem to have stumbled across a bug, and was wondering if anyone else has been seeing this behavior. I've opened up an issue in the Spark JIRA, I simply want to pass this along in case anyone else is