Re: Improving metadata in Spark JIRA
Open pull request count is down to 254 right now from ~325 several weeks ago. Open JIRA count is down slightly to 1262 from a peak over ~1320. Obviously, in the face of an ever faster and larger stream of contributions. There's a real positive impact of JIRA being a little more meaningful, a little less backlog to keep looking at, getting commits in slightly faster, slightly happier contributors, etc. The virtuous circle can keep going. It'd be great if every contributor could take a moment to look at his or her open PRs and JIRAs. Example searches (replace with your user name / name): https://github.com/apache/spark/pulls/srowen https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20reporter%20%3D%20%22Sean%20Owen%22%20or%20assignee%20%3D%20%22Sean%20Owen%22 For PRs: - if it appears to be waiting on your action or feedback, - push more changes and/or reply to comments, or - if it isn't work you can pursue in the immediate future, close the PR - if it appears to be waiting on others, - if it's had feedback and it's unclear whether there's support to commit as-is, - break down or reduce the change to something less controversial - close the PR as softly rejected - if there's no feedback or plainly waiting for action, ping @them For JIRAs: - If it's fixed along the way, or obsolete, resolve as Fixed or NotAProblem - Do a quick search to see if a similar issue has been filed and is resolved or has more activity; resolve as Duplicate if so - Check that fields are assigned reasonably: - Meaningful title and description - Reasonable type and priority. Not everything is a major bug, and few are blockers - 1+ Component - 1+ Affects version - Avoid setting target version until it looks like there's momentum to merge a resolution - If the JIRA has had no activity in a long time (6+ months), but does not feel obsolete, try to move it to some resolution: - Request feedback, from specific people if desired, to feel out if there is any other support for the change - Add more info, like a specific reproduction for bugs - Narrow scope of feature requests to something that contains a few actionable steps, instead of broad open-ended wishes - Work on a fix. In an ideal world people are willing to work to resolve JIRAs they open, and don't fire-and-forget If everyone did this, not only would it advance the house-cleaning a bit more, but I'm sure we'd rediscover some important work and issues that need attention. On Sun, Feb 22, 2015 at 7:54 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: As of right now, there are no more open JIRA issues without an assigned component https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20EMPTY%20ORDER%20BY%20updated%20DESC! Hurray! [image: yay] Thanks to Sean and others for the cleanup! Nick
Re: Improving metadata in Spark JIRA
Open pull request count is down to 254 right now from ~325 several weeks ago. This great. Ideally, we need to get this down to 50 and keep it there. Having so many open pull requests is just a bad signal to contributors. But it will take some time to get there. - 1+ Component Sean, do you have permission to edit our JIRA settings? It should be possible to enforce this in JIRA itself. - 1+ Affects version I don’t think this field makes sense for improvements, right? Nick On Sun Feb 22 2015 at 9:43:24 AM Sean Owen so...@cloudera.com wrote: Open pull request count is down to 254 right now from ~325 several weeks ago. Open JIRA count is down slightly to 1262 from a peak over ~1320. Obviously, in the face of an ever faster and larger stream of contributions. There's a real positive impact of JIRA being a little more meaningful, a little less backlog to keep looking at, getting commits in slightly faster, slightly happier contributors, etc. The virtuous circle can keep going. It'd be great if every contributor could take a moment to look at his or her open PRs and JIRAs. Example searches (replace with your user name / name): https://github.com/apache/spark/pulls/srowen https://issues.apache.org/jira/issues/?jql=project%20% 3D%20SPARK%20AND%20reporter%20%3D%20%22Sean%20Owen%22% 20or%20assignee%20%3D%20%22Sean%20Owen%22 For PRs: - if it appears to be waiting on your action or feedback, - push more changes and/or reply to comments, or - if it isn't work you can pursue in the immediate future, close the PR - if it appears to be waiting on others, - if it's had feedback and it's unclear whether there's support to commit as-is, - break down or reduce the change to something less controversial - close the PR as softly rejected - if there's no feedback or plainly waiting for action, ping @them For JIRAs: - If it's fixed along the way, or obsolete, resolve as Fixed or NotAProblem - Do a quick search to see if a similar issue has been filed and is resolved or has more activity; resolve as Duplicate if so - Check that fields are assigned reasonably: - Meaningful title and description - Reasonable type and priority. Not everything is a major bug, and few are blockers - 1+ Component - 1+ Affects version - Avoid setting target version until it looks like there's momentum to merge a resolution - If the JIRA has had no activity in a long time (6+ months), but does not feel obsolete, try to move it to some resolution: - Request feedback, from specific people if desired, to feel out if there is any other support for the change - Add more info, like a specific reproduction for bugs - Narrow scope of feature requests to something that contains a few actionable steps, instead of broad open-ended wishes - Work on a fix. In an ideal world people are willing to work to resolve JIRAs they open, and don't fire-and-forget If everyone did this, not only would it advance the house-cleaning a bit more, but I'm sure we'd rediscover some important work and issues that need attention. On Sun, Feb 22, 2015 at 7:54 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: As of right now, there are no more open JIRA issues without an assigned component https://issues.apache.org/jira/issues/?jql=project%20% 3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND% 20component%20%3D%20EMPTY%20ORDER%20BY%20updated%20DESC! Hurray! [image: yay] Thanks to Sean and others for the cleanup! Nick
Git Achievements
For fun: http://acha-acha.co/#/repo/https://github.com/apache/spark I just added Spark to this site. Some of these “achievements” are hilarious. Leo Tolstoy: More than 10 lines in a commit message Dangerous Game: Commit after 6PM friday Nick
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
So what are we expecting of Hive 0.12.0 builds with this RC? I know not every combination of Hadoop and Hive versions, etc., can be supported, but even an example build from the Building Spark page isn't looking too good to me. Working from f97b0d4, the example build command works: mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-0.12.0 -Phive-thriftserver -DskipTests clean package ...but then running the tests results in multiple failures in the Hive and Hive Thrift Server sub-projects. On Wed, Feb 18, 2015 at 12:12 AM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.3.0! The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1069/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/ Please vote on releasing this package as Apache Spark 1.3.0! The vote is open until Saturday, February 21, at 08:03 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.3.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == How can I help test this release? == If you are a Spark user, you can help us test this release by taking a Spark 1.2 workload and running on this release candidate, then reporting any regressions. == What justifies a -1 vote for this release? == This vote is happening towards the end of the 1.3 QA period, so -1 votes should only occur for significant regressions from 1.2.1. Bugs already present in 1.2.X, minor regressions, or bugs related to new features will not block this release. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Have Friedman's glmnet algo running in Spark
Hi Mike, glmnet has definitely been very successful, and it would be great to see how we can improve optimization in MLlib! There is some related work ongoing; here are the JIRAs: GLMNET implementation in Spark https://issues.apache.org/jira/browse/SPARK-1673 LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package https://issues.apache.org/jira/browse/SPARK-5253 The GLMNET JIRA has actually been closed in favor of the latter JIRA. However, if you're getting good results in your experiments, could you please post them on the GLMNET JIRA and link them from the other JIRA? If it's faster and more scalable, that would be great to find out. As far as where the code should go and the APIs, that can be discussed on the JIRA. I hope this helps, and I'll keep an eye out for updates on the JIRAs! Joseph On Thu, Feb 19, 2015 at 10:59 AM, m...@mbowles.com wrote: Dev List, A couple of colleagues and I have gotten several versions of glmnet algo coded and running on Spark RDD. glmnet algo ( http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for generating coefficient paths solving penalized regression with elastic net penalties. The algorithm runs fast by taking an approach that generates solutions for a wide variety of penalty parameter. We're able to integrate into Mllib class structure a couple of different ways. The algorithm may fit better into the new pipeline structure since it naturally returns a multitide of models (corresponding to different vales of penalty parameters). That appears to fit better into pipeline than Mllib linear regression (for example). We've got regression running with the speed optimizations that Friedman recommends. We'll start working on the logistic regression version next. We're eager to make the code available as open source and would like to get some feedback about how best to do that. Any thoughts? Mike Bowles.
Re: Spark SQL - Long running job
How about persisting the computed result table first before caching it? So that you only need to cache the result table after restarting your service without recomputing it. Somewhat like checkpointing. Cheng On 2/22/15 12:55 AM, nitin wrote: Hi All, I intend to build a long running spark application which fetches data/tuples from parquet, does some processing(time consuming) and then cache the processed table (InMemoryColumnarTableScan). My use case is good retrieval time for SQL query(benefits of Spark SQL optimizer) and data compression(in-built in in-memory caching). Now the problem is that if my driver goes down, I will have to fetch the data again for all the tables and compute it and cache which is time consuming. Is it possible to persist processed/cached RDDs on disk such that my system up time is less when restarted after failure/going down? On a side note, the data processing contains a shuffle step which creates huge temporary shuffle files on local disk in temp folder and as per current logic, shuffle files don't get deleted for running executors. This is leading to my local disk getting filled up quickly and going out of space as its a long running spark job. (running spark in yarn-client mode btw). Thanks -Nitin -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: textFile() ordering and header rows
I guess on a technicality the docs just say first item in this RDD, not first line in the source text file. AFAIK there is no way apart from filtering to remove header lines http://stackoverflow.com/a/24734612/877069. As long as first() always returns the same value for a given RDD, I think it's fine, no? Nick On Sun Feb 22 2015 at 9:09:01 PM Michael Malak michaelma...@yahoo.com.invalid wrote: Since RDDs are generally unordered, aren't things like textFile().first() not guaranteed to return the first row (such as looking for a header row)? If so, doesn't that make the example in http://spark.apache.org/docs/1.2.1/quick-start.html#basics misleading? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
textFile() ordering and header rows
Since RDDs are generally unordered, aren't things like textFile().first() not guaranteed to return the first row (such as looking for a header row)? If so, doesn't that make the example in http://spark.apache.org/docs/1.2.1/quick-start.html#basics misleading? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org