Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Hari Shreedharan
Ah, ok. It was missing in the list of jiras. So +1. Thanks, Hari On Mon, Apr 6, 2015 at 11:36 AM, Patrick Wendell pwend...@gmail.com wrote: I believe TD just forgot to set the fix version on the JIRA. There is a fix for this in 1.3:

Re: Spark + Kinesis

2015-04-06 Thread Tathagata Das
Cc'ing Chris Fregly, who wrote the Kinesis integration. Maybe he can help. On Mon, Apr 6, 2015 at 9:23 AM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: Hi all, I am wondering, has anyone on this list been able to successfully implement Spark on top of Kinesis? Best, Vadim ᐧ On

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Mark Hamstra
+1 On Sat, Apr 4, 2015 at 5:09 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):

Re: [VOTE] Release Apache Spark 1.2.2

2015-04-06 Thread Krishna Sankar
+1 On Sun, Apr 5, 2015 at 4:24 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.2! The tag to be voted on is v1.2.2-rc1 (commit 7531b50):

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Patrick Wendell
I believe TD just forgot to set the fix version on the JIRA. There is a fix for this in 1.3: https://github.com/apache/spark/commit/03e263f5b527cf574f4ffcd5cd886f7723e3756e - Patrick On Mon, Apr 6, 2015 at 2:31 PM, Mark Hamstra m...@clearstorydata.com wrote: Is that correct, or is the JIRA

Re: [VOTE] Release Apache Spark 1.2.2

2015-04-06 Thread Reynold Xin
+1 too On Sun, Apr 5, 2015 at 4:24 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.2! The tag to be voted on is v1.2.2-rc1 (commit 7531b50):

Re: Stochastic gradient descent performance

2015-04-06 Thread Reynold Xin
Note that we can do this in DataFrames and use Catalyst to push Sample down beneath Projection :) On Mon, Apr 6, 2015 at 12:42 PM, Xiangrui Meng men...@gmail.com wrote: The gap sampling is triggered when the sampling probability is small and the directly underlying storage has constant time

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Sean McNamara
+1 On Apr 4, 2015, at 6:11 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.1! The tag to be voted on is v1.3.1-rc1 (commit 0dcb5d9f):

Re: Stochastic gradient descent performance

2015-04-06 Thread Xiangrui Meng
The gap sampling is triggered when the sampling probability is small and the directly underlying storage has constant time lookups, in particular, ArrayBuffer. This is a very strict requirement. If rdd is cached in memory, we use ArrayBuffer to store its elements and rdd.sample will trigger gap

Re: Support parallelized online matrix factorization for Collaborative Filtering

2015-04-06 Thread Xiangrui Meng
This is being discussed in https://issues.apache.org/jira/browse/SPARK-6407. Let's move the discussion there. Thanks for providing references! -Xiangrui On Sun, Apr 5, 2015 at 11:48 PM, Chunnan Yao yaochun...@gmail.com wrote: On-line Collaborative Filtering(CF) has been widely used and studied.

Re: Experience using binary packages on various Hadoop distros

2015-04-06 Thread Dean Chen
This would be great for those of us running on HDP. At eBay we recently ran in to few problems using the generic Hadoop lib. Two off of the top of my head: * Needed to included our custom Hadoop client due to custom keberos integration * Minor difference in HDFS protocol causing the following

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread mjhb
Similar problem on 1.2 branch: [ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve dependencies for project org.apache.spark:spark-core_2.11:jar:1.2.3-SNAPSHOT: The following artifacts could not be resolved: org.apache.spark:spark-network-common_2.10:jar:1.2.3-SNAPSHOT,

Re: Zinc now required?

2015-04-06 Thread Sean Owen
I don't think it's required. This looks like zinc is running (it seems to find the process on port 3030), but, something is wrong with zinc then. If you aren't running your own zinc, then it's the copy downloaded by Spark. Maybe try deleting that and shutting down the zinc process, and trying a

[mllib] Deprecate static train and use builder instead for Scala/Java

2015-04-06 Thread Yu Ishikawa
Hi all, Joseph proposed an idea about using just builder methods, instead of static train() methods for Scala/Java. I agree with that idea. Because we have many duplicated static train() method. If you have any thoughts on that please share it with us. [SPARK-6682] Deprecate static train and

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Marty Bower
I'm killing zinc (if it's running) before running each build attempt. Trying to build as clean as possible. On Mon, Apr 6, 2015 at 7:31 PM Patrick Wendell pwend...@gmail.com wrote: What if you don't run zinc? I.e. just download maven and run that mvn package It might take longer, but I

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
The issue is that if you invoke build/mvn it will start zinc again if it sees that it is killed. The absolute most sterile thing to do is this: 1. Kill any zinc processes. 2. Clean up spark git clean -fdx (WARNING: this will delete any staged changes you have, if you have code modifications or

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
One thing that I think can cause issues is if you run build/mvn with Scala 2.10, then try to run it with 2.11, since I think we may store some downloaded jars relating to zinc that will get screwed up. Not sure that's what is happening, just an idea. On Mon, Apr 6, 2015 at 10:54 PM, Patrick

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread mjhb
I resorted to deleting the spark directory between each build earlier today (attempting maximum sterility) and then re-cloning from github and switching to the 1.2 or 1.3 branch. Does anything persist outside of the spark directory? Are you able to build either 1.2 or 1.3 w/ Scala-2.11? --

1.3 Build Error with Scala-2.11

2015-04-06 Thread mjhb
$dev/change-version-to-2.11.sh $build/mvn -e -DskipTests clean package [ERROR] Failed to execute goal on project spark-core_2.11: Could not resolve dependencies for project org.apache.spark:spark-core_2.11:jar:1.3.2-SNAPSHOT: The following artifacts could not be resolved:

Re: Zinc now required?

2015-04-06 Thread mjhb
Killing zinc resolved the problem building with scala-2.10 - thank you. (adding that to my build script) Having problems building with scala-2.11 - will post separately for that if reproducible. -- View this message in context:

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
What if you don't run zinc? I.e. just download maven and run that mvn package It might take longer, but I wonder if it will work. On Mon, Apr 6, 2015 at 10:26 PM, mjhb sp...@mjhb.com wrote: Similar problem on 1.2 branch: [ERROR] Failed to execute goal on project spark-core_2.11: Could not

Zinc now required?

2015-04-06 Thread mjhb
Today I cannot build the 1.2 branch: [INFO] [INFO] Building Spark Project Networking 1.2.3-SNAPSHOT [INFO] snipped [INFO] ---

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
The only think that can persist outside of Spark is if there is still a live Zinc process. We took care to make sure this was a generally stateless mechanism. Both the 1.2.X and 1.3.X releases are built with Scala 2.11 for packaging purposes. And these have been built as recently as in the last

Re: 1.3 Build Error with Scala-2.11

2015-04-06 Thread Patrick Wendell
Hmm.. Make sure you are building with the right flags. I think you need to pass -Dscala-2.11 to maven. Take a look at the upstream docs - on my phone now so can't easily access. On Apr 7, 2015 1:01 AM, mjhb sp...@mjhb.com wrote: I even deleted my local maven repository (.m2) but still stuck

Support parallelized online matrix factorization for Collaborative Filtering

2015-04-06 Thread Chunnan Yao
On-line Collaborative Filtering(CF) has been widely used and studied. To re-train a CF model from scratch every time when new data comes in is very inefficient (http://stackoverflow.com/questions/27734329/apache-spark-incremental-training-of-als-model). However, in Spark community we see few

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

2015-04-06 Thread Reynold Xin
I think those are great to have. I would put them in the DataFrame API though, since this is applying to structured data. Many of the advanced functions on the PairRDDFunctions should really go into the DataFrame API now we have it. One thing that would be great to understand is what

Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

2015-04-06 Thread Grega Kešpret
Hi! I'd like to get community's opinion on implementing a generic quantile approximation algorithm for Spark that is O(n) and requires limited memory. I would find it useful and I haven't found any existing implementation. The plan was basically to wrap t-digest

Re: Wrong initial bias in GraphX SVDPlusPlus?

2015-04-06 Thread Sean Owen
See now: https://issues.apache.org/jira/browse/SPARK-6710 On Mon, Apr 6, 2015 at 4:27 AM, Reynold Xin r...@databricks.com wrote: Adding Jianping Wang to the thread, since he contributed the SVDPlusPlus implementaiton. Jianping, Can you take a look at this message? Thanks. On Fri, Apr 3,

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-06 Thread Sean Owen
SPARK-6673 is not, in the end, relevant for 1.3.x I believe; we just resolved it for 1.4 anyway. False alarm there. I back-ported SPARK-6205 into the 1.3 branch for next time. We'll pick it up if there's another RC, but by itself is not something that needs a new RC. (I will give the same