Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-21 Thread Sean Owen
While I'd officially -1 this while there are still many blockers, this should certainly be tested as usual, because they're mostly doc and "audit" type issues. On Wed, Jun 22, 2016 at 2:26 AM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache

Re: Structured Streaming partition logic with respect to storage and fileformat

2016-06-21 Thread Sachin Aggarwal
what will the scenario in case of s3 and local file system? On Tue, Jun 21, 2016 at 4:36 PM, Jörn Franke wrote: > Based on the underlying Hadoop FileFormat. This one does it mostly based > on blocksize. You can change this though. > > On 21 Jun 2016, at 12:19, Sachin

Re: Question about Bloom Filter in Spark 2.0

2016-06-21 Thread Reynold Xin
SPARK-12818 is about building a bloom filter on existing data. It has nothing to do with the ORC bloom filter, which can be used to do predicate pushdown. On Tue, Jun 21, 2016 at 7:45 PM, BaiRan wrote: > Hi all, > > I have a question about bloom filter implementation in

Question about Bloom Filter in Spark 2.0

2016-06-21 Thread BaiRan
Hi all, I have a question about bloom filter implementation in Spark-12818 issue. If I have a ORC file with bloom filter metadata, how can I utilise it by Spark SQL? Thanks. Best, Ran

[VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-21 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 2.0.0. The vote is open until Friday, June 24, 2016 at 19:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.0.0 [ ] -1 Do not release this package because ...

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-21 Thread Shixiong(Ryan) Zhu
Hey Pete, I just pushed your PR to branch 1.6. As it's not a blocker, it may or may not be in 1.6.2, depending on if there will be another RC. On Tue, Jun 21, 2016 at 1:36 PM, Pete Robbins wrote: > It breaks Spark running on machines with less than 3 cores/threads, which >

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-21 Thread Pete Robbins
It breaks Spark running on machines with less than 3 cores/threads, which may be rare, and it is maybe an edge case. Personally, I like to fix known bugs and the fact there are other blocking methods in event loops actually makes it worse not to fix ones that you know about. Probably not a

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-21 Thread Sean Owen
Nice one, yeah indeed I was doing an incremental build. Not a blocker. I'll have a look into the others, though I suspect they're problems with tests rather than production code. On Tue, Jun 21, 2016 at 6:53 PM, Marcelo Vanzin wrote: > On Tue, Jun 21, 2016 at 10:49 AM, Sean

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-21 Thread Marcelo Vanzin
On Tue, Jun 21, 2016 at 10:49 AM, Sean Owen wrote: > I'm getting some errors building on Ubuntu 16 + Java 7. First is one > that may just be down to a Scala bug: > > [ERROR] bad symbolic reference. A signature in WebUI.class refers to > term eclipse > in package org which is

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-21 Thread Sean Owen
I'm getting some errors building on Ubuntu 16 + Java 7. First is one that may just be down to a Scala bug: [ERROR] bad symbolic reference. A signature in WebUI.class refers to term eclipse in package org which is not available. It may be completely missing from the current classpath, or the

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-21 Thread Shixiong(Ryan) Zhu
Hey Pete, I didn't backport it to 1.6 because it just affects tests in most cases. I'm sure we also have other places calling blocking methods in the event loops, so similar issues are still there even after applying this patch. Hence, I don't think it's a blocker for 1.6.2. On Tue, Jun 21, 2016

Jar for Spark developement

2016-06-21 Thread tesm...@gmail.com
Hi, Beginner in Spark development. Took time to configure Eclipse + Scala. Is there any tutorial that can help beginners. Still struggling to find Spark JAR files for development. There is no lib folder in my Spark distribution (neither in pre-built nor in custom built..) Regards,

Re: Possible contribution to MLlib

2016-06-21 Thread Jeff Zhang
I think it is valuable to make the distance function pluggable and also provide some builtin distance function. This might be also useful for other algorithms besides KMeans. On Tue, Jun 21, 2016 at 7:48 PM, Simon NANTY wrote: > Hi all, > > > > In my team, we are

Re: Structured Streaming partition logic with respect to storage and fileformat

2016-06-21 Thread Jörn Franke
Based on the underlying Hadoop FileFormat. This one does it mostly based on blocksize. You can change this though. > On 21 Jun 2016, at 12:19, Sachin Aggarwal wrote: > > > when we use readStream to read data as Stream, how spark decides the no of > RDD and

Possible contribution to MLlib

2016-06-21 Thread Simon NANTY
Hi all, In my team, we are currently developing a fork of spark MLlib extending K-means method such that it is possible to set its own distance function. In this implementation, it could be possible to directly pass, in argument of the K-means train function, a distance function whose

Structured Streaming partition logic with respect to storage and fileformat

2016-06-21 Thread Sachin Aggarwal
when we use readStream to read data as Stream, how spark decides the no of RDD and partition within each RDD with respect to storage and file format. val dsJson = sqlContext.readStream.json("/Users/sachin/testSpark/inputJson") val dsCsv = sqlContext.readStream.option("header","true").csv(

Re: [VOTE] Release Apache Spark 1.6.2 (RC2)

2016-06-21 Thread Pete Robbins
The PR (https://github.com/apache/spark/pull/13055) to fix https://issues.apache.org/jira/browse/SPARK-15262 was applied to 1.6.2 however this fix caused another issue https://issues.apache.org/jira/browse/SPARK-15606 the fix for which ( https://github.com/apache/spark/pull/13355) has not been