Suggestion: RDD cache depth

2014-05-29 Thread innowireless TaeYun Kim
It would be nice if the RDD cache() method incorporate a depth information. That is, void test() { JavaRDD. rdd = .; rdd.cache(); // to depth 1. actual caching happens. rdd.cache(); // to depth 2. Nop as long as the storage level is the same. Else, exception. . rdd.uncache(); // to

GraphX triplets on 5-node graph

2014-05-29 Thread Michael Malak
Shouldn't I be seeing N2 and N4 in the output below? (Spark 0.9.0 REPL) Or am I missing something fundamental? val nodes = sc.parallelize(Array((1L, N1), (2L, N2), (3L, N3), (4L, N4), (5L, N5))) val edges = sc.parallelize(Array(Edge(1L, 2L, E1), Edge(1L, 3L, E2), Edge(2L, 4L, E3), Edge(3L,

Re: Suggestion: RDD cache depth

2014-05-29 Thread Matei Zaharia
This is a pretty cool idea — instead of cache depth I’d call it something like reference counting. Would you mind opening a JIRA issue about it? The issue of really composing together libraries that use RDDs nicely isn’t fully explored, but this is certainly one thing that would help with it.

Re: GraphX triplets on 5-node graph

2014-05-29 Thread Reynold Xin
Take a look at this one: https://issues.apache.org/jira/browse/SPARK-1188 It was an optimization that added user inconvenience. We got rid of that now in Spark 1.0. On Wed, May 28, 2014 at 11:48 PM, Michael Malak michaelma...@yahoo.comwrote: Shouldn't I be seeing N2 and N4 in the output

RE: Suggestion: RDD cache depth

2014-05-29 Thread innowireless TaeYun Kim
Opened a JIRA issue. (https://issues.apache.org/jira/browse/SPARK-1962) Thanks. -Original Message- From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: Thursday, May 29, 2014 3:54 PM To: dev@spark.apache.org Subject: Re: Suggestion: RDD cache depth This is a pretty cool idea -

Re: LogisticRegression: Predicting continuous outcomes

2014-05-29 Thread Bharath Ravi Kumar
Xiangrui, Christopher, Thanks for responding. I'll go through the code in detail to evaluate if the loss function used is suitable to our dataset. I'll also go through the referred paper since I was unaware of the underlying theory. Thanks again. -Bharath On Thu, May 29, 2014 at 8:16 AM,

Please change instruction about Launching Applications Inside the Cluster

2014-05-29 Thread Lizhengbing (bing, BIPA)
The instruction address is in http://spark.apache.org/docs/0.9.0/spark-standalone.html#launching-applications-inside-the-cluster or http://spark.apache.org/docs/0.9.1/spark-standalone.html#launching-applications-inside-the-cluster Origin instruction is: ./bin/spark-class

Re: Standard preprocessing/scaling

2014-05-29 Thread dataginjaninja
I do see the issue for centering sparse data. Actually, the centering is less important than the scaling by the standard deviation. Not having unit variance causes the convergence issues and long runtimes. RowMatrix will compute variance of a column? -- View this message in context:

Timestamp support in v1.0

2014-05-29 Thread dataginjaninja
Can anyone verify which rc [SPARK-1360] Add Timestamp Support for SQL #275 https://github.com/apache/spark/pull/275 is included in? I am running rc3, but receiving errors with TIMESTAMP as a datatype in my Hive tables when trying to use them in pyspark. *The error I get: * 14/05/29 15:44:47

Re: Timestamp support in v1.0

2014-05-29 Thread Andrew Ash
I can confirm that the commit is included in the 1.0.0 release candidates (it was committed before branch-1.0 split off from master), but I can't confirm that it works in PySpark. Generally the Python and Java interfaces lag a little behind the Scala interface to Spark, but we're working to keep

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Patrick Wendell
+1 I spun up a few EC2 clusters and ran my normal audit checks. Tests passing, sigs, CHANGES and NOTICE look good Thanks TD for helping cut this RC! On Wed, May 28, 2014 at 9:38 PM, Kevin Markey kevin.mar...@oracle.com wrote: +1 Built -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 Ran current

Re: Timestamp support in v1.0

2014-05-29 Thread Michael Armbrust
Thanks for reporting this! https://issues.apache.org/jira/browse/SPARK-1964 https://github.com/apache/spark/pull/913 If you could test out that PR and see if it fixes your problems I'd really appreciate it! Michael On Thu, May 29, 2014 at 9:09 AM, Andrew Ash and...@andrewash.com wrote: I

Re: Timestamp support in v1.0

2014-05-29 Thread dataginjaninja
Yes, I get the same error: scala val hc = new org.apache.spark.sql.hive.HiveContext(sc) 14/05/29 16:53:40 INFO deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive 14/05/29 16:53:40 INFO deprecation: mapred.max.split.size is

Re: Timestamp support in v1.0

2014-05-29 Thread dataginjaninja
Michael, Will I have to rebuild after adding the change? Thanks -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6855.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Timestamp support in v1.0

2014-05-29 Thread dataginjaninja
Darn, I was hoping just to sneak it in that file. I am not the only person working on the cluster; if I rebuild it that means I have to redeploy everything to all the nodes as well. So I cannot do that ... today. If someone else doesn't beat me to it. I can rebuild at another time. -

Re: Timestamp support in v1.0

2014-05-29 Thread Michael Armbrust
Yes, you'll need to download the code from that PR and reassemble Spark (sbt/sbt assembly). On Thu, May 29, 2014 at 10:02 AM, dataginjaninja rickett.stepha...@gmail.com wrote: Michael, Will I have to rebuild after adding the change? Thanks -- View this message in context:

Re: Timestamp support in v1.0

2014-05-29 Thread Michael Armbrust
You should be able to get away with only doing it locally. This bug is happening during analysis which only occurs on the driver. On Thu, May 29, 2014 at 10:17 AM, dataginjaninja rickett.stepha...@gmail.com wrote: Darn, I was hoping just to sneak it in that file. I am not the only person

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

2014-05-29 Thread Patrick Wendell
[tl;dr stable API's are important - sorry, this is slightly meandering] Hey - just wanted to chime in on this as I was travelling. Sean, you bring up great points here about the velocity and stability of Spark. Many projects have fairly customized semantics around what versions actually mean

[RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Tathagata Das
Hello everyone, The vote on Spark 1.0.0 RC11 passes with13 +1 votes, one 0 vote and no -1 vote. Thanks to everyone who tested the RC and voted. Here are the totals: +1: (13 votes) Matei Zaharia* Mark Hamstra* Holden Karau Nick Pentreath* Will Benton Henry Saputra Sean McNamara* Xiangrui Meng*

Re: [VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Tathagata Das
Let me put in my +1 as well! This voting is now closed, and it successfully passes with 13 +1 votes and one 0 vote. Thanks to everyone who tested the RC and voted. Here are the totals: +1: (13 votes) Matei Zaharia* Mark Hamstra* Holden Karau Nick Pentreath* Will Benton Henry Saputra Sean

Re: [RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Matei Zaharia
Yup, congrats all. The most impressive thing is the number of contributors to this release — with over 100 contributors, it’s becoming hard to even write the credits. Look forward to the Apache press release tomorrow. Matei On May 29, 2014, at 1:33 PM, Patrick Wendell pwend...@gmail.com wrote:

Re: [RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)

2014-05-29 Thread Andy Konwinski
Yes great work all. Special thanks to Patrick (and TD) for excellent leadership! On May 29, 2014 5:39 PM, Usman Ghani us...@platfora.com wrote: Congrats everyone. Really pumped about this. On Thu, May 29, 2014 at 2:57 PM, Henry Saputra henry.sapu...@gmail.com wrote: Congrats guys! Another

how spark partition data when creating table like create table xxx as select * from xxx

2014-05-29 Thread qingyang li
hi, spark-developers, i am using shark/spark, and i am puzzled by such question, and can not find any info from the web, so i ask you. 1. how spark partition data in memory when creating table when using create table a tblproperties(shark.cache=memory) as select * from table b , in another