Re: Inconsistency for nullvalue handling CSV: see SPARK-16462, SPARK-16460, SPARK-15144, SPARK-17290 and SPARK-16903

2016-08-29 Thread Nicholas Chammas
I wish JIRA would automatically show you potentially similar issues as you are typing up a new one, like Stack Overflow does... It would really help cut down on duplicate reports. On Mon, Aug 29, 2016 at 10:55 PM Hyukjin Kwon wrote: > Hi all, > > > PR: >

Inconsistency for nullvalue handling CSV: see SPARK-16462, SPARK-16460, SPARK-15144, SPARK-17290 and SPARK-16903

2016-08-29 Thread Hyukjin Kwon
Hi all, PR: https://github.com/apache/spark/pull/14118 JIRAs https://issues.apache.org/jira/browse/SPARK-17290 https://issues.apache.org/jira/browse/SPARK-16903 https://issues.apache.org/jira/browse/SPARK-16462 https://issues.apache.org/jira/browse/SPARK-16460

Re: Real time streaming in Spark

2016-08-29 Thread Luciano Resende
There were some prototypes/discussions being done on top of Spark Streaming, and they were discussing how that would fit with regards to Structured Streaming which was in design mode at that time. See https://issues.apache.org/jira/browse/SPARK-14745 for some details and link to PR. On Mon, Aug

Saving less data to improve Pregel performance in GraphX?

2016-08-29 Thread Fang Zhang
Dear developers, I am running some tests using Pregel API. It seems to me that more than 90% of the volume of a graph object is composed of index structures that will not change during the execution of Pregel. When the size of a graph is too huge to fit in memory, Pregel will persist

KMeans calls takeSample() twice?

2016-08-29 Thread gsamaras
After reading the internal code of Spark about it, I wasn't able to understand why it calls takeSample() twice? Can someone please explain? There is a relevant StackOverflow question . -- View this message in

Real time streaming in Spark

2016-08-29 Thread Tomasz Gawęda
Hi everyone, I wonder if there are plans to implement real time streaming in Spark. I see that in Spark 2.0 Trigger can have more implementations than ProcessingTime. In my opinion Real Time streaming (so reaction on every event - like continous queries in Apache Ignite) will be very useful

Re: Performance of loading parquet files into case classes in Spark

2016-08-29 Thread Julien Dumazert
Hi Maciek, I followed your recommandation and benchmarked Dataframes aggregations on Dataset. Here is what I got: // Dataset with RDD-style code // 34.223s df.as[A].map(_.fieldToSum).reduce(_ + _) // Dataset with map and Dataframes sum // 35.372s

Re: Structured Streaming with Kafka sources/sinks

2016-08-29 Thread Fred Reiss
I think that the community really needs some feedback on the progress of this very important task. Many existing Spark Streaming applications can't be ported to Structured Streaming without Kafka support. Is there a design document somewhere? Or can someone from the DataBricks team break down

Re: spark roadmap

2016-08-29 Thread Mark Hamstra
At this point, there is no target date set for 2.1. That's something that we should do fairly soon, but right now there is at least a little room for discussion as to whether we want to continue with the same pace of releases that we targeted throughout the 1.x development cycles, or whether

[build system] jenkins wedged itself this weekend, just restarted

2016-08-29 Thread shane knapp
jenkins got in to one of it's "states" and wasn't accepting new builds starting this past saturday night. i restarted it, and now it's catching up on the weekend's queue. shane - To unsubscribe e-mail:

Re: Broadcast Variable Life Cycle

2016-08-29 Thread Sean Owen
Yes you want to actively unpersist() or destroy() broadcast variables when they're no longer needed. They can eventually be removed when the reference on the driver is garbage collected, but you usually would not want to rely on that. On Mon, Aug 29, 2016 at 4:30 PM, Jerry Lam

Remaining folders in .sparkStaging directory after app was killed

2016-08-29 Thread Artur Sukhenko
Hello spark devs, Whenever I run spark app in yarn-cluster mode, do Ctrl+C to stop spark-submit and yarn application -kill I have remaining folders in hdfs: .sparkStaging/application_1472140614688_0001 .sparkStaging/application_1472140614688_0002 Those folders will never be deleted? And if so,

Re: Broadcast Variable Life Cycle

2016-08-29 Thread Jerry Lam
Hello spark developers, Anyone can shed some lights on the life cycle of the broadcast variables? Basically, if I have a broadcast variable defined in a loop and for each iteration, I provide a different value. // For example: for(i< 1 to 10) { val bc = sc.broadcast(i)

Re: Spark 2.0 and Yarn

2016-08-29 Thread Saisai Shao
This archive contains all the jars required by Spark runtime, you could zip all the jars under /jars and upload this archive to HDFS, then configure spark.yarn.archive with the path of this archive on HDFS. On Sun, Aug 28, 2016 at 9:59 PM, Srikanth Sampath wrote: >

spark roadmap

2016-08-29 Thread Denis Bolshakov
Hello spark devs, Does any one can provide a roadmap for the nearest two months? Or at least say when we can expect 2.1 release and which features will be added? -- //with Best Regards --Denis Bolshakov e-mail: bolshakov.de...@gmail.com