pyspark on windows (Upgrade from 1.6 to 2.0.2): sqlContext.read.format fails

2017-08-25 Thread Shiv Onkar Kumar
Hi, The following line works very well in 1.6 although fails in2.0.2. Any idea, what could be the issue   df_train =sqlContext.read.format("libsvm").load("../data/mllib/sample_linear_regression_data.txt")   The error is    File "", line 1, in     df_train =

Why do checkpoints work the way they do?

2017-08-25 Thread Hugo Reinwald
Hello, I am implementing a spark streaming solution with Kafka and read that checkpoints cannot be used across application code changes - here I tested changes in application code and got the error message as b below -

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-25 Thread swetha kasireddy
Because I saw some posts that say that consumer cache enabled will have concurrentModification exception with reduceByKeyAndWIndow. I see those errors as well after running for sometime with cache being enabled. So, I had to disable it. Please see the tickets below. We have 96 partitions. So if

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-25 Thread swetha kasireddy
Because I saw some posts that say that consumer cache enabled will have concurrentModification exception with reduceByKeyAndWIndow. I see those errors as well after running for sometime with cache being enabled. So, I had to disable it. Please see the tickets below. We have 96 partitions. So if

Re: [Streaming][Structured Streaming] Understanding dynamic allocation in streaming jobs

2017-08-25 Thread Karthik Palaniappan
I definitely agree that dynamic allocation is useful, that's why I asked the question :p More specifically, does spark plan to solve the problems with DRA for structured streaming mentioned in that Cloudera article? If folks can give me pointers on where to start, I'd be happy to implement

Re: Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-25 Thread Cody Koeninger
Why are you setting consumer.cache.enabled to false? On Fri, Aug 25, 2017 at 2:19 PM, SRK wrote: > Hi, > > What would be the appropriate settings to run Spark with Kafka 10? My job > works fine with Spark with Kafka 8 and with Kafka 8 cluster. But its very > slow with

Slower performance while running Spark Kafka Direct Streaming with Kafka 10 cluster

2017-08-25 Thread SRK
Hi, What would be the appropriate settings to run Spark with Kafka 10? My job works fine with Spark with Kafka 8 and with Kafka 8 cluster. But its very slow with Kafka 10 by using Kafka Direct' experimental APIs for Kafka 10 . I see the following error sometimes . Please see the kafka parameters

RE: [Spark Streaming] Streaming Dynamic Allocation is broken (at least on YARN)

2017-08-25 Thread Karthik Palaniappan
You have to set spark.executor.instances=0 in a streaming application with dynamic allocation: https://github.com/tdas/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala#L207. I originally had it set to a positive value, and

Re: Structured Streaming: multiple sinks

2017-08-25 Thread Tathagata Das
My apologies Chris. Somehow I have not received the first email by OP, and hence thought our answers to OP as cryptic questions. :/ I found the full thread on nabble. I agree with your analysis of OP's question 1. On Fri, Aug 25, 2017 at 12:48 AM, Chris Bowden wrote: >

Re: Structured Streaming: multiple sinks

2017-08-25 Thread Chris Bowden
Tathagata, thanks for filling in context for other readers on 2a and 2b, I summarized too much in hindsight. Regarding the OP's first question, I was hinting it is quite natural to chain processes via kafka. If you are already interested in writing processed data to kafka, why add complexity to a

Re: Structured Streaming: multiple sinks

2017-08-25 Thread Tathagata Das
Responses inline. On Thu, Aug 24, 2017 at 7:16 PM, cbowden wrote: > 1. would it not be more natural to write processed to kafka and sink > processed from kafka to s3? > I am sorry i dont fully understand this question. Could you please elaborate further, as in, what is