Re: Improving metadata in Spark JIRA

2015-02-22 Thread Sean Owen
Open pull request count is down to 254 right now from ~325 several weeks ago. Open JIRA count is down slightly to 1262 from a peak over ~1320. Obviously, in the face of an ever faster and larger stream of contributions. There's a real positive impact of JIRA being a little more meaningful, a

Re: Improving metadata in Spark JIRA

2015-02-22 Thread Nicholas Chammas
Open pull request count is down to 254 right now from ~325 several weeks ago. This great. Ideally, we need to get this down to 50 and keep it there. Having so many open pull requests is just a bad signal to contributors. But it will take some time to get there. - 1+ Component Sean, do you

Git Achievements

2015-02-22 Thread Nicholas Chammas
For fun: http://acha-acha.co/#/repo/https://github.com/apache/spark I just added Spark to this site. Some of these “achievements” are hilarious. Leo Tolstoy: More than 10 lines in a commit message Dangerous Game: Commit after 6PM friday Nick ​

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-22 Thread Mark Hamstra
So what are we expecting of Hive 0.12.0 builds with this RC? I know not every combination of Hadoop and Hive versions, etc., can be supported, but even an example build from the Building Spark page isn't looking too good to me. Working from f97b0d4, the example build command works: mvn -Pyarn

Re: Have Friedman's glmnet algo running in Spark

2015-02-22 Thread Joseph Bradley
Hi Mike, glmnet has definitely been very successful, and it would be great to see how we can improve optimization in MLlib! There is some related work ongoing; here are the JIRAs: GLMNET implementation in Spark https://issues.apache.org/jira/browse/SPARK-1673 LinearRegression with L1/L2

Re: Spark SQL - Long running job

2015-02-22 Thread Cheng Lian
How about persisting the computed result table first before caching it? So that you only need to cache the result table after restarting your service without recomputing it. Somewhat like checkpointing. Cheng On 2/22/15 12:55 AM, nitin wrote: Hi All, I intend to build a long running spark

Re: textFile() ordering and header rows

2015-02-22 Thread Nicholas Chammas
I guess on a technicality the docs just say first item in this RDD, not first line in the source text file. AFAIK there is no way apart from filtering to remove header lines http://stackoverflow.com/a/24734612/877069. As long as first() always returns the same value for a given RDD, I think it's

textFile() ordering and header rows

2015-02-22 Thread Michael Malak
Since RDDs are generally unordered, aren't things like textFile().first() not guaranteed to return the first row (such as looking for a header row)? If so, doesn't that make the example in http://spark.apache.org/docs/1.2.1/quick-start.html#basics misleading?