If TypedColumn is a subclass of Column, why I cannot apply function on it in Dataset?

2017-03-18 Thread Yong Zhang
In the following example, after I used "typed.avg" to generate a TypedColumn, and I want to apply round on top of it? But why Spark complains about it? Because it doesn't know that it is a TypedColumn? Thanks Yong scala> spark.version res20: String = 2.1.0 scala> case

Re: Spark Streaming from Kafka, deal with initial heavy load.

2017-03-18 Thread Mal Edwin
Hi, You can enable backpressure to handle this. spark.streaming.backpressure.enabled spark.streaming.receiver.maxRate Thanks, Edwin On Mar 18, 2017, 12:53 AM -0400, sagarcasual . , wrote: > Hi, we have spark 1.6.1 streaming from Kafka (0.10.1) topic using direct >

Re: spark streaming exectors memory increasing and executor killed by yarn

2017-03-18 Thread Bill Schwanitz
I have had similar issues with some of my spark jobs especially doing things like repartitioning. spark.yarn.driver.memoryOverhead driverMemory * 0.10, with minimum of 384 The amount of off-heap memory (in megabytes) to be allocated per driver in cluster mode. This is memory that accounts for

Re: [Spark SQL & Core]: RDD to Dataset 1500 columns data with createDataFrame() throw exception of grows beyond 64 KB

2017-03-18 Thread Kazuaki Ishizaki
Hi There is the latest status for code generation. When we use the master that will be Spark 2.2, the following exception occurs. The latest version fixed 64KB errors in this case. However, we meet another JVM limit, the number of the constant pool entry. Caused by:

Re: Streaming 2.1.0 - window vs. batch duration

2017-03-18 Thread Dominik Safaric
If I were to set the window duration to 60 seconds, while having a batch interval equal to a second, and a slide duration of 59 seconds I would get the desired behaviour. However, would the Receiver pull messages from Kafka only at the 59th second slide interval or it would constantly pull the

Re: Streaming 2.1.0 - window vs. batch duration

2017-03-18 Thread Dominik Safaric
Correct - that is the part that I understood nicely. However, what alternative transformation might I apply to iterate through the RDDs considering a window duration of 60 seconds which I cannot change? > On 17 Mar 2017, at 16:57, Cody Koeninger wrote: > > Probably

[Spark SQL & Core]: RDD to Dataset 1500 columns data with createDataFrame() throw exception of grows beyond 64 KB

2017-03-18 Thread elevy
Hello all, I am using the Spark 2.1.0 release, I am trying to load BigTable CSV file with more than 1500 columns into our system Our flow of doing that is: • First, read the data as an RDD • generate continuing record id using the zipWithIndex() (this operation exist only in RDD API,

SparkStreaming getActiveOrCreate

2017-03-18 Thread Justin Pihony
The docs on getActiveOrCreate makes it seem that you'll get an already started context: > Either return the "active" StreamingContext (that is, started but not > stopped), or create a new StreamingContext that is However as far as I can tell from the code it is strictly dependent on the the