Re: Improving metadata in Spark JIRA

2015-02-22 Thread Sean Owen
Open pull request count is down to 254 right now from ~325 several weeks
ago.
Open JIRA count is down slightly to 1262 from a peak over ~1320.
Obviously, in the face of an ever faster and larger stream of contributions.

There's a real positive impact of JIRA being a little more meaningful, a
little less backlog to keep looking at, getting commits in slightly faster,
slightly happier contributors, etc.


The virtuous circle can keep going. It'd be great if every contributor
could take a moment to look at his or her open PRs and JIRAs. Example
searches (replace with your user name / name):

https://github.com/apache/spark/pulls/srowen
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20reporter%20%3D%20%22Sean%20Owen%22%20or%20assignee%20%3D%20%22Sean%20Owen%22

For PRs:

- if it appears to be waiting on your action or feedback,
  - push more changes and/or reply to comments, or
  - if it isn't work you can pursue in the immediate future, close the PR

- if it appears to be waiting on others,
  - if it's had feedback and it's unclear whether there's support to commit
as-is,
- break down or reduce the change to something less controversial
- close the PR as softly rejected
  - if there's no feedback or plainly waiting for action, ping @them

For JIRAs:

- If it's fixed along the way, or obsolete, resolve as Fixed or NotAProblem

- Do a quick search to see if a similar issue has been filed and is
resolved or has more activity; resolve as Duplicate if so

- Check that fields are assigned reasonably:
  - Meaningful title and description
  - Reasonable type and priority. Not everything is a major bug, and few
are blockers
  - 1+ Component
  - 1+ Affects version
  - Avoid setting target version until it looks like there's momentum to
merge a resolution

- If the JIRA has had no activity in a long time (6+ months), but does not
feel obsolete, try to move it to some resolution:
  - Request feedback, from specific people if desired, to feel out if there
is any other support for the change
  - Add more info, like a specific reproduction for bugs
  - Narrow scope of feature requests to something that contains a few
actionable steps, instead of broad open-ended wishes
  - Work on a fix. In an ideal world people are willing to work to resolve
JIRAs they open, and don't fire-and-forget


If everyone did this, not only would it advance the house-cleaning a bit
more, but I'm sure we'd rediscover some important work and issues that need
attention.


On Sun, Feb 22, 2015 at 7:54 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 As of right now, there are no more open JIRA issues without an assigned
 component
 https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20EMPTY%20ORDER%20BY%20updated%20DESC!
 Hurray!

 [image: yay]

 Thanks to Sean and others for the cleanup!

 Nick

 ​



Re: Improving metadata in Spark JIRA

2015-02-22 Thread Nicholas Chammas
Open pull request count is down to 254 right now from ~325 several weeks
ago.

This great. Ideally, we need to get this down to  50 and keep it there.
Having so many open pull requests is just a bad signal to contributors. But
it will take some time to get there.


   - 1+ Component

 Sean, do you have permission to edit our JIRA settings? It should be
possible to enforce this in JIRA itself.


   - 1+ Affects version

 I don’t think this field makes sense for improvements, right?

Nick
​

On Sun Feb 22 2015 at 9:43:24 AM Sean Owen so...@cloudera.com wrote:

 Open pull request count is down to 254 right now from ~325 several weeks
 ago.
 Open JIRA count is down slightly to 1262 from a peak over ~1320.
 Obviously, in the face of an ever faster and larger stream of
 contributions.

 There's a real positive impact of JIRA being a little more meaningful, a
 little less backlog to keep looking at, getting commits in slightly faster,
 slightly happier contributors, etc.


 The virtuous circle can keep going. It'd be great if every contributor
 could take a moment to look at his or her open PRs and JIRAs. Example
 searches (replace with your user name / name):

 https://github.com/apache/spark/pulls/srowen
 https://issues.apache.org/jira/issues/?jql=project%20%
 3D%20SPARK%20AND%20reporter%20%3D%20%22Sean%20Owen%22%
 20or%20assignee%20%3D%20%22Sean%20Owen%22

 For PRs:

 - if it appears to be waiting on your action or feedback,
   - push more changes and/or reply to comments, or
   - if it isn't work you can pursue in the immediate future, close the PR

 - if it appears to be waiting on others,
   - if it's had feedback and it's unclear whether there's support to commit
 as-is,
 - break down or reduce the change to something less controversial
 - close the PR as softly rejected
   - if there's no feedback or plainly waiting for action, ping @them

 For JIRAs:

 - If it's fixed along the way, or obsolete, resolve as Fixed or NotAProblem

 - Do a quick search to see if a similar issue has been filed and is
 resolved or has more activity; resolve as Duplicate if so

 - Check that fields are assigned reasonably:
   - Meaningful title and description
   - Reasonable type and priority. Not everything is a major bug, and few
 are blockers
   - 1+ Component
   - 1+ Affects version
   - Avoid setting target version until it looks like there's momentum to
 merge a resolution

 - If the JIRA has had no activity in a long time (6+ months), but does not
 feel obsolete, try to move it to some resolution:
   - Request feedback, from specific people if desired, to feel out if there
 is any other support for the change
   - Add more info, like a specific reproduction for bugs
   - Narrow scope of feature requests to something that contains a few
 actionable steps, instead of broad open-ended wishes
   - Work on a fix. In an ideal world people are willing to work to resolve
 JIRAs they open, and don't fire-and-forget


 If everyone did this, not only would it advance the house-cleaning a bit
 more, but I'm sure we'd rediscover some important work and issues that need
 attention.


 On Sun, Feb 22, 2015 at 7:54 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

  As of right now, there are no more open JIRA issues without an assigned
  component
  https://issues.apache.org/jira/issues/?jql=project%20%
 3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%
 20component%20%3D%20EMPTY%20ORDER%20BY%20updated%20DESC!
  Hurray!
 
  [image: yay]
 
  Thanks to Sean and others for the cleanup!
 
  Nick
 
  ​
 



Git Achievements

2015-02-22 Thread Nicholas Chammas
For fun:

http://acha-acha.co/#/repo/https://github.com/apache/spark

I just added Spark to this site. Some of these “achievements” are hilarious.

Leo Tolstoy: More than 10 lines in a commit message

Dangerous Game: Commit after 6PM friday

Nick
​


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-22 Thread Mark Hamstra
So what are we expecting of Hive 0.12.0 builds with this RC?  I know not
every combination of Hadoop and Hive versions, etc., can be supported, but
even an example build from the Building Spark page isn't looking too good
to me.

Working from f97b0d4, the example build command works: mvn -Pyarn
-Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-0.12.0
-Phive-thriftserver -DskipTests clean package
...but then running the tests results in multiple failures in the Hive and
Hive Thrift Server sub-projects.


On Wed, Feb 18, 2015 at 12:12 AM, Patrick Wendell pwend...@gmail.com
wrote:

 Please vote on releasing the following candidate as Apache Spark version
 1.3.0!

 The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a):

 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1069/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.3.0!

 The vote is open until Saturday, February 21, at 08:03 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.3.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.2 workload and running on this release candidate,
 then reporting any regressions.

 == What justifies a -1 vote for this release? ==
 This vote is happening towards the end of the 1.3 QA period,
 so -1 votes should only occur for significant regressions from 1.2.1.
 Bugs already present in 1.2.X, minor regressions, or bugs related
 to new features will not block this release.

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Have Friedman's glmnet algo running in Spark

2015-02-22 Thread Joseph Bradley
Hi Mike,

glmnet has definitely been very successful, and it would be great to see
how we can improve optimization in MLlib!  There is some related work
ongoing; here are the JIRAs:

GLMNET implementation in Spark
https://issues.apache.org/jira/browse/SPARK-1673

LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package
https://issues.apache.org/jira/browse/SPARK-5253

The GLMNET JIRA has actually been closed in favor of the latter JIRA.
However, if you're getting good results in your experiments, could you
please post them on the GLMNET JIRA and link them from the other JIRA?  If
it's faster and more scalable, that would be great to find out.

As far as where the code should go and the APIs, that can be discussed on
the JIRA.

I hope this helps, and I'll keep an eye out for updates on the JIRAs!

Joseph


On Thu, Feb 19, 2015 at 10:59 AM, m...@mbowles.com wrote:

 Dev List,
 A couple of colleagues and I have gotten several versions of glmnet algo
 coded and running on Spark RDD. glmnet algo (
 http://www.jstatsoft.org/v33/i01/paper) is a very fast algorithm for
 generating coefficient paths solving penalized regression with elastic net
 penalties. The algorithm runs fast by taking an approach that generates
 solutions for a wide variety of penalty parameter. We're able to integrate
 into Mllib class structure a couple of different ways. The algorithm may
 fit better into the new pipeline structure since it naturally returns a
 multitide of models (corresponding to different vales of penalty
 parameters). That appears to fit better into pipeline than Mllib linear
 regression (for example).

 We've got regression running with the speed optimizations that Friedman
 recommends. We'll start working on the logistic regression version next.

 We're eager to make the code available as open source and would like to
 get some feedback about how best to do that. Any thoughts?
 Mike Bowles.





Re: Spark SQL - Long running job

2015-02-22 Thread Cheng Lian
How about persisting the computed result table first before caching it? 
So that you only need to cache the result table after restarting your 
service without recomputing it. Somewhat like checkpointing.


Cheng

On 2/22/15 12:55 AM, nitin wrote:

Hi All,

I intend to build a long running spark application which fetches data/tuples
from parquet, does some processing(time consuming) and then cache the
processed table (InMemoryColumnarTableScan). My use case is good retrieval
time for SQL query(benefits of Spark SQL optimizer) and data
compression(in-built in in-memory caching). Now the problem is that if my
driver goes down, I will have to fetch the data again for all the tables and
compute it and cache which is time consuming.

Is it possible to persist processed/cached RDDs on disk such that my system
up time is less when restarted after failure/going down?

On a side note, the data processing contains a shuffle step which creates
huge temporary shuffle files on local disk in temp folder and as per current
logic, shuffle files don't get deleted for running executors. This is
leading to my local disk getting filled up quickly and going out of space as
its a long running spark job. (running spark in yarn-client mode btw).

Thanks
-Nitin



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: textFile() ordering and header rows

2015-02-22 Thread Nicholas Chammas
I guess on a technicality the docs just say first item in this RDD, not
first line in the source text file. AFAIK there is no way apart from
filtering to remove header lines
http://stackoverflow.com/a/24734612/877069.

As long as first() always returns the same value for a given RDD, I think
it's fine, no?

Nick


On Sun Feb 22 2015 at 9:09:01 PM Michael Malak
michaelma...@yahoo.com.invalid wrote:

 Since RDDs are generally unordered, aren't things like textFile().first()
 not guaranteed to return the first row (such as looking for a header row)?
 If so, doesn't that make the example in
 http://spark.apache.org/docs/1.2.1/quick-start.html#basics misleading?

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




textFile() ordering and header rows

2015-02-22 Thread Michael Malak
Since RDDs are generally unordered, aren't things like textFile().first() not 
guaranteed to return the first row (such as looking for a header row)? If so, 
doesn't that make the example in 
http://spark.apache.org/docs/1.2.1/quick-start.html#basics misleading?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org