Re: Welcoming two new committers

2016-02-08 Thread Corey Nolet
Congrats guys!

On Mon, Feb 8, 2016 at 12:23 PM, Ted Yu  wrote:

> Congratulations, Herman and Wenchen.
>
> On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> The PMC has recently added two new Spark committers -- Herman van Hovell
>> and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten,
>> adding new features, optimizations and APIs. Please join me in welcoming
>> Herman and Wenchen.
>>
>> Matei
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread Corey Nolet
David,

Thank you very much for announcing this! It looks like it could be very
useful. Would you mind providing a link to the github?

On Tue, Jan 12, 2016 at 10:03 AM, David 
wrote:

> Hi all,
>
> I'd like to share news of the recent release of a new Spark package, ROSE.
>
>
> ROSE is a Scala library offering access to the full scientific computing
> power of the R programming language to Apache Spark batch and streaming
> applications on the JVM. Where Apache SparkR lets data scientists use Spark
> from R, ROSE is designed to let Scala and Java developers use R from Spark.
>
> The project is available and documented on GitHub and I would encourage
> you to take a look. Any feedback, questions etc very welcome.
>
> David
>
> "All that is gold does not glitter, Not all those who wander are lost."
>


Re: Forecasting Library For Apache Spark

2015-09-21 Thread Corey Nolet
Mohamed,

Have you checked out the Spark Timeseries [1] project? Non-seasonal ARIMA
was added to this recently and seasonal ARIMA should be following shortly.

[1] https://github.com/cloudera/spark-timeseries

On Mon, Sep 21, 2015 at 7:47 AM, Mohamed Baddar 
wrote:

> Hello everybody , this my first mail in the List , and i would like to
> introduce my self first :)
> My Name is Mohamed baddar , I work as Big Data and Analytics Software
> Engieer at BADRIT (http://badrit.com/) , a software Startup with focus in
> Big Data , also i have been working for 6+ years at IBM R Egypt , in HPC
> , Big Data and Analytics Are
>
> I just have a question , i can't find supported Apache Spark library for
> forecasting using ARIMA , ETS , Bayesian model or any method , is there any
> plans for such a development , as i can't find any issue talking about it ,
> is any one interested to have/develop a related module , as i find it a
> critical feature to be added to SPARK
>
> Thanks
>


Re: MongoDB and Spark

2015-09-11 Thread Corey Nolet
Unfortunately, MongoDB does not directly expose its locality via its client
API so the problem with trying to schedule Spark tasks against it is that
the tasks themselves cannot be scheduled locally on nodes containing query
results- which means you can only assume most results will be sent over the
network to the task that needs to process it. This is bad. The other reason
(which is also related to the issue of locality) is that I'm not sure if
there's an easy way to spread the results of a query over multiple
different clients- thus you'd probably have to start your Spark RDD with a
single partition and then repartition. What you've done at that point is
you've taken data from multiple mongodb nodes and you've collected them on
a single node just to re-partition them, again across the network, onto
multiple nodes. This is also bad.

I think this is the reason it was recommended to use MongoDB's mapreduce
because they can use their locality information internally. I had this same
issue w/ Couchbase a couple years back- it's unfortunate but it's the
reality.




On Fri, Sep 11, 2015 at 9:34 AM, Sandeep Giri 
wrote:

> I think it should be possible by loading collections as RDD and then doing
> a union on them.
>
> Regards,
> Sandeep Giri,
> +1 347 781 4573 (US)
> +91-953-899-8962 (IN)
>
> www.KnowBigData.com. 
> Phone: +1-253-397-1945 (Office)
>
> [image: linkedin icon]  [image:
> other site icon]   [image: facebook icon]
>  [image: twitter icon]
>  
>
>
> On Fri, Sep 11, 2015 at 3:40 PM, Mishra, Abhishek <
> abhishek.mis...@xerox.com> wrote:
>
>> Anything using Spark RDD’s ???
>>
>>
>>
>> Abhishek
>>
>>
>>
>> *From:* Sandeep Giri [mailto:sand...@knowbigdata.com]
>> *Sent:* Friday, September 11, 2015 3:19 PM
>> *To:* Mishra, Abhishek; u...@spark.apache.org; dev@spark.apache.org
>> *Subject:* Re: MongoDB and Spark
>>
>>
>>
>> use map-reduce.
>>
>>
>>
>> On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek 
>> wrote:
>>
>> Hello ,
>>
>>
>>
>> Is there any way to query multiple collections from mongodb using spark
>> and java.  And i want to create only one Configuration Object. Please help
>> if anyone has something regarding this.
>>
>>
>>
>>
>>
>> Thank You
>>
>> Abhishek
>>
>>
>


Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Corey Nolet
+1 (non-binding)

- Verified signatures
- Built on Mac OS X and Fedora 21.

On Mon, Mar 9, 2015 at 11:01 PM, Krishna Sankar ksanka...@gmail.com wrote:

 Excellent, Thanks Xiangrui. The mystery is solved.
 Cheers
 k/


 On Mon, Mar 9, 2015 at 3:30 PM, Xiangrui Meng men...@gmail.com wrote:

  Krishna, I tested your linear regression example. For linear
  regression, we changed its objective function from 1/n * \|A x -
  b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least
  squares formulations. It means you could re-produce the same result by
  multiplying the step size by 2. This is not a problem if both run
  until convergence (if not blow up). However, in your example, a very
  small step size is chosen and it didn't converge in 100 iterations. In
  this case, the step size matters. I will put a note in the migration
  guide. Thanks! -Xiangrui
 
  On Mon, Mar 9, 2015 at 1:38 PM, Sean Owen so...@cloudera.com wrote:
   I'm +1 as I have not heard of any one else seeing the Hive test
   failure, which is likely a test issue rather than code issue anyway,
   and not a blocker.
  
   On Fri, Mar 6, 2015 at 9:36 PM, Sean Owen so...@cloudera.com wrote:
   Although the problem is small, especially if indeed the essential docs
   changes are following just a couple days behind the final release, I
   mean, why the rush if they're essential? wait a couple days, finish
   them, make the release.
  
   Answer is, I think these changes aren't actually essential given the
   comment from tdas, so: just mark these Critical? (although ... they do
   say they're changes for the 1.3 release, so kind of funny to get to
   them for 1.3.x or 1.4, but that's not important now.)
  
   I thought that Blocker really meant Blocker in this project, as I've
   been encouraged to use it to mean don't release without this. I
   think we should use it that way. Just thinking of it as extra
   Critical doesn't add anything. I don't think Documentation should be
   special-cased as less important, and I don't think there's confusion
   if Blocker means what it says, so I'd 'fix' that way.
  
   If nobody sees the Hive failure I observed, and if we can just zap
   those Blockers one way or the other, +1
  
  
   On Fri, Mar 6, 2015 at 9:17 PM, Patrick Wendell pwend...@gmail.com
  wrote:
   Sean,
  
   The docs are distributed and consumed in a fundamentally different
 way
   than Spark code itself. So we've always considered the deadline for
   doc changes to be when the release is finally posted.
  
   If there are small inconsistencies with the docs present in the
 source
   code for that release tag, IMO that doesn't matter much since we
 don't
   even distribute the docs with Spark's binary releases and virtually
 no
   one builds and hosts the docs on their own (that I am aware of, at
   least). Perhaps we can recommend if people want to build the doc
   sources that they should always grab the head of the most recent
   release branch, to set expectations accordingly.
  
   In the past we haven't considered it worth holding up the release
   process for the purpose of the docs. It just doesn't make sense since
   they are consumed as a service. If we decide to change this
   convention, it would mean shipping our releases later, since we
   could't pipeline the doc finalization with voting.
  
   - Patrick
  
   On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen so...@cloudera.com
 wrote:
   Given the title and tagging, it sounds like there could be some
   must-have doc changes to go with what is being released as 1.3. It
 can
   be finished later, and published later, but then the docs source
   shipped with the release doesn't match the site, and until then, 1.3
   is released without some must-have docs for 1.3 on the site.
  
   The real question to me is: are there any further, absolutely
   essential doc changes that need to accompany 1.3 or not?
  
   If not, just resolve these. If there are, then it seems like the
   release has to block on them. If there are some docs that should
 have
   gone in for 1.3, but didn't, but aren't essential, well I suppose it
   bears thinking about how to not slip as much work, but it doesn't
   block.
  
   I think Documentation issues certainly can be a blocker and
 shouldn't
   be specially ignored.
  
  
   BTW the UISeleniumSuite issue is a real failure, but I do not think
 it
   is serious: http://issues.apache.org/jira/browse/SPARK-6205  It
 isn't
   a regression from 1.2.x, but only affects tests, and only affects a
   subset of build profiles.
  
  
  
  
   On Fri, Mar 6, 2015 at 6:43 PM, Patrick Wendell pwend...@gmail.com
 
  wrote:
   Hey Sean,
  
   SPARK-5310 Update SQL programming guide for 1.3
   SPARK-5183 Document data source API
   SPARK-6128 Update Spark Streaming Guide for Spark 1.3
  
   For these, the issue is that they are documentation JIRA's, which
   don't need to be timed exactly with the release vote, since we can
   update the documentation on the 

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Corey Nolet
This vote was supposed to close on Saturday but it looks like no PMCs voted
(other than the implicit vote from Patrick). Was there a discussion offline
to cut an RC2? Was the vote extended?

On Mon, Feb 23, 2015 at 6:59 AM, Robin East robin.e...@xense.co.uk wrote:

 Running ec2 launch scripts gives me the following error:

 ssl.SSLError: [Errno 1] _ssl.c:504: error:14090086:SSL
 routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

 Full stack trace at
 https://gist.github.com/insidedctm/4d41600bc22560540a26

 I’m running OSX Mavericks 10.9.5

 I’ll investigate further but wondered if anyone else has run into this.

 Robin


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Corey Nolet
Thanks Sean. I glossed over the comment about SPARK-5669.

On Mon, Feb 23, 2015 at 9:05 AM, Sean Owen so...@cloudera.com wrote:

 Yes my understanding from Patrick's comment is that this RC will not
 be released, but, to keep testing. There's an implicit -1 out of the
 gates there, I believe, and so the vote won't pass, so perhaps that's
 why there weren't further binding votes. I'm sure that will be
 formalized shortly.

 FWIW here are 10 issues still listed as blockers for 1.3.0:

 SPARK-5910 DataFrame.selectExpr(col as newName) does not work
 SPARK-5904 SPARK-5166 DataFrame methods with varargs do not work in Java
 SPARK-5873 Can't see partially analyzed plans
 SPARK-5546 Improve path to Kafka assembly when trying Kafka Python API
 SPARK-5517 SPARK-5166 Add input types for Java UDFs
 SPARK-5463 Fix Parquet filter push-down
 SPARK-5310 SPARK-5166 Update SQL programming guide for 1.3
 SPARK-5183 SPARK-5180 Document data source API
 SPARK-3650 Triangle Count handles reverse edges incorrectly
 SPARK-3511 Create a RELEASE-NOTES.txt file in the repo


 On Mon, Feb 23, 2015 at 1:55 PM, Corey Nolet cjno...@gmail.com wrote:
  This vote was supposed to close on Saturday but it looks like no PMCs
 voted
  (other than the implicit vote from Patrick). Was there a discussion
 offline
  to cut an RC2? Was the vote extended?



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-19 Thread Corey Nolet
+1 (non-binding)

- Verified signatures using [1]
- Built on MacOSX Yosemite
- Built on Fedora 21

Each build was run with and Hadoop-2.4 version with yarn, hive, and
hive-thriftserver profiles

I am having trouble getting all the tests passing on a single run on both
machines but we have this same problem on other projects as well.

[1] https://github.com/cjnolet/nexus-staging-gpg-verify


On Wed, Feb 18, 2015 at 6:25 PM, Sean Owen so...@cloudera.com wrote:

 On Wed, Feb 18, 2015 at 6:13 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Patrick this link gives a 404:
  https://people.apache.org/keys/committer/pwendell.asc
 
  Works for me. Maybe it's some ephemeral issue?

 Yes works now; I swear it didn't before! that's all set now. The
 signing key is in that file.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Replacing Jetty with TomCat

2015-02-17 Thread Corey Nolet
Niranda,

I'm not sure if I'd say Spark's use of Jetty to expose its UI monitoring
layer constitutes a use of two web servers in a single product. Hadoop
uses Jetty as well as do many other applications today that need embedded
http layers for serving up their monitoring UI to users. This is completely
aside from any web container an application developer would use to interact
with Spark and Hadoop and service domain-specific content to users. The two
are disjoint.

Many applications use Thrift as a means of establishing socket connections
between clients and across servers. One alternative to Thrift is Protobuf.
You wouldn't say I want to swap out thrift for protobuf in Cassandra
because I want to use protobuf in my application and there shouldn't be two
different socket layer abstractions on my cluster.

I could understand wanting to do this if you were being forced to deploy a
war file to a web container in order to do the monitoring but Spark's UI is
embedded within the code. If you are worried about having the Jetty
libraries on your classpath, you can exclude the Jetty dependencies from
your servlet code if you want to interact with a SparkContext in Tomcat.



On Tue, Feb 17, 2015 at 10:22 PM, Patrick Wendell pwend...@gmail.com
wrote:

 Hey Niranda,

 It seems to me a lot of effort to support multiple libraries inside of
 Spark like this, so I'm not sure that's a great solution.

 If you are building an application that embeds Spark, is it not
 possible for you to continue to use Jetty for Spark's internal servers
 and use tomcat for your own server's? I would guess that many complex
 applications end up embedding multiple server libraries in various
 places (Spark itself has different transport mechanisms, etc.)

 - Patrick

 On Tue, Feb 17, 2015 at 7:14 PM, Niranda Perera
 niranda.per...@gmail.com wrote:
  Hi Sean,
  The main issue we have is, running two web servers in a single product.
 we
  think it would not be an elegant solution.
 
  Could you please point me to the main areas where jetty server is tightly
  coupled or extension points where I could plug tomcat instead of jetty?
  If successful I could contribute it to the spark project. :-)
 
  cheers
 
 
 
  On Mon, Feb 16, 2015 at 4:51 PM, Sean Owen so...@cloudera.com wrote:
 
  There's no particular reason you have to remove the embedded Jetty
  server, right? it doesn't prevent you from using it inside another app
  that happens to run in Tomcat. You won't be able to switch it out
  without rewriting a fair bit of code, no, but you don't need to.
 
  On Mon, Feb 16, 2015 at 5:08 AM, Niranda Perera
  niranda.per...@gmail.com wrote:
   Hi,
  
   We are thinking of integrating Spark server inside a product. Our
 current
   product uses Tomcat as its webserver.
  
   Is it possible to switch the Jetty webserver in Spark to Tomcat
   off-the-shelf?
  
   Cheers
  
   --
   Niranda
 
 
 
 
  --
  Niranda

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Welcoming three new committers

2015-02-03 Thread Corey Nolet
Congrats guys!

On Tue, Feb 3, 2015 at 7:01 PM, Evan Chan velvia.git...@gmail.com wrote:

 Congrats everyone!!!

 On Tue, Feb 3, 2015 at 3:17 PM, Timothy Chen tnac...@gmail.com wrote:
  Congrats all!
 
  Tim
 
 
  On Feb 4, 2015, at 7:10 AM, Pritish Nawlakhe 
 prit...@nirvana-international.com wrote:
 
  Congrats and welcome back!!
 
 
 
  Thank you!!
 
  Regards
  Pritish
  Nirvana International Inc.
 
  Big Data, Hadoop, Oracle EBS and IT Solutions
  VA - SWaM, MD - MBE Certified Company
  prit...@nirvana-international.com
  http://www.nirvana-international.com
  Twitter: @nirvanainternat
 
  -Original Message-
  From: Hari Shreedharan [mailto:hshreedha...@cloudera.com]
  Sent: Tuesday, February 3, 2015 6:02 PM
  To: Ted Yu
  Cc: Nicholas Chammas; dev; Joseph Bradley; Cheng Lian; Matei Zaharia;
 Sean Owen
  Subject: Re: Welcoming three new committers
 
  Congrats Cheng, Joseph and Owen! Well done!
 
 
 
 
  Thanks, Hari
 
  On Tue, Feb 3, 2015 at 2:55 PM, Ted Yu yuzhih...@gmail.com wrote:
 
  Congratulations, Cheng, Joseph and Sean.
  On Tue, Feb 3, 2015 at 2:53 PM, Nicholas Chammas
  nicholas.cham...@gmail.com
  wrote:
  Congratulations guys!
 
  On Tue Feb 03 2015 at 2:36:12 PM Matei Zaharia
  matei.zaha...@gmail.com
  wrote:
 
  Hi all,
 
  The PMC recently voted to add three new committers: Cheng Lian,
  Joseph Bradley and Sean Owen. All three have been major
  contributors to Spark in the past year: Cheng on Spark SQL, Joseph
  on MLlib, and Sean on ML and
  many
  pieces throughout Spark Core. Join me in welcoming them as
 committers!
 
  Matei
  ---
  -- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For
  additional commands, e-mail: dev-h...@spark.apache.org
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Spark SQL API changes and stabilization

2015-01-15 Thread Corey Nolet
Reynold,

One thing I'd like worked into the public portion of the API is the json
inferencing logic that creates a Set[(String, StructType)] out of
Map[String,Any]. SPARK-5260 addresses this so that I can use Accumulators
to infer my schema instead of forcing a map/reduce phase to occur on an RDD
in order to get the final schema. Do you (or anyone else) see a path
forward in exposing this to users? A utility class perhaps?

On Thu, Jan 15, 2015 at 1:33 PM, Reynold Xin r...@databricks.com wrote:

 Alex,

 I didn't communicate properly. By private, I simply meant the expectation
 that it is not a public API. The plan is to still omit it from the
 scaladoc/javadoc generation, but no language visibility modifier will be
 applied on them.

 After 1.3, you will likely no longer need to use things in sql.catalyst
 package directly. Programmatically construct SchemaRDDs is going to be a
 first class public API. Data types have already been moved out of the
 sql.catalyst package and now lives in sql.types. They are becoming stable
 public APIs. When the data frame patch is submitted, you will see a
 public expression library also. There will be few reason for end users or
 library developers to hook into things in sql.catalyst. For the bravest and
 the most advanced, they can still use them, with the expectation that it is
 subject to change.





 On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta alexbare...@gmail.com
 
 wrote:

  Reynold,
 
  Thanks for the heads up. In general, I strongly oppose the use of
  private to restrict access to certain parts of the API, the reason
 being
  that I might find the need to use some of the internals of a library from
  my own project. I find that a @DeveloperAPI annotation serves the same
  purpose as private without imposing unnecessary restrictions: it
  discourages people from using the annotated API and reserves the right
 for
  the core developers to change it suddenly in backwards incompatible ways.
 
  In particular, I would like to express the desire that the APIs to
  programmatically construct SchemaRDDs from an RDD[Row] and a StructType
  remain public. All the SparkSQL data type objects should be exposed by
 the
  API, and the jekyll build should not hide the docs as it does now.
 
  Thanks.
 
  Alex
 
  On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin r...@databricks.com
 wrote:
 
  Hi Spark devs,
 
  Given the growing number of developers that are building on Spark SQL,
 we
  would like to stabilize the API in 1.3 so users and developers can be
  confident to build on it. This also gives us a chance to improve the
 API.
 
  In particular, we are proposing the following major changes. This should
  have no impact for most users (i.e. those running SQL through the JDBC
  client or SQLContext.sql method).
 
  1. Everything in sql.catalyst package is private to the project.
 
  2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for
  SchemaRDD and logical plans in order to construct test cases. We have
  received feedback from a lot of users that the DSL can be incredibly
  powerful. In 1.3, we’d like to refactor the DSL to make it suitable for
  not
  only constructing test cases, but also in everyday data pipelines. The
 new
  SchemaRDD API is inspired by the data frame concept in Pandas and R.
 
  3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose
 one
  set of APIs that will work for both Java and Scala. The current Java API
  (sql.api.java) does not share any common ancestor with the Scala API.
 This
  led to high maintenance burden for us as Spark developers and for
 library
  developers. We propose to eliminate the Java specific API, and simply
 work
  on the existing Scala API to make it also usable for Java. This will
 make
  Java a first class citizen as Scala. This effectively means that all
  public
  classes should be usable for both Scala and Java, including SQLContext,
  HiveContext, SchemaRDD, data types, and the aforementioned DSL.
 
 
  Again, this should have no impact on most users since the existing DSL
 is
  rarely used by end users. However, library developers might need to
 change
  the import statements because we are moving certain classes around. We
  will
  keep you posted as patches are merged.
 
 
 



Re: [ANNOUNCE] Spark 1.2.0 Release Preview Posted

2014-11-20 Thread Corey Nolet
I was actually about to post this myself- I have a complex join that could
benefit from something like a GroupComparator vs having to do multiple
grouyBy operations. This is probably the wrong thread for a full discussion
on this but I didn't see a JIRA ticket for this or anything similar- any
reasons why this would not make sense given Spark's design?

On Thu, Nov 20, 2014 at 9:39 AM, Madhu ma...@madhu.com wrote:

 Thanks Patrick.

 I've been testing some 1.2 features, looks good so far.
 I have some example code that I think will be helpful for certain MR-style
 use cases (secondary sort).
 Can I still add that to the 1.2 documentation, or is that frozen at this
 point?



 -
 --
 Madhu
 https://www.linkedin.com/in/msiddalingaiah
 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Spark-1-2-0-Release-Preview-Posted-tp9400p9449.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Spark Hadoop 2.5.1

2014-11-14 Thread Corey Nolet
I noticed Spark 1.2.0-SNAPSHOT still has 2.4.x in the pom. Since 2.5.x is
the current stable Hadoop 2.x, would it make sense for us to update the
poms?


Re: Spark Hadoop 2.5.1

2014-11-14 Thread Corey Nolet
In the past, I've built it by providing -Dhadoop.version=2.5.1 exactly like
you've mentioned. What prompted me to write this email was that I did not
see any documentation that told me Hadoop 2.5.1 was officially supported by
Spark (i.e. community has been using it, any bugs are being fixed, etc...).
It builds, tests pass, etc... but there could be other implications that I
have not run into based on my own use of the framework.

If we are saying that the standard procedure is to build with the
hadoop-2.4 profile and override the -Dhadoop.version property, should we
provide that on the build instructions [1] at least?

[1] http://spark.apache.org/docs/latest/building-with-maven.html

On Fri, Nov 14, 2014 at 10:46 AM, Sean Owen so...@cloudera.com wrote:

 I don't think it's necessary. You're looking at the hadoop-2.4
 profile, which works with anything = 2.4. AFAIK there is no further
 specialization needed beyond that. The profile sets hadoop.version to
 2.4.0 by default, but this can be overridden.

 On Fri, Nov 14, 2014 at 3:43 PM, Corey Nolet cjno...@gmail.com wrote:
  I noticed Spark 1.2.0-SNAPSHOT still has 2.4.x in the pom. Since 2.5.x is
  the current stable Hadoop 2.x, would it make sense for us to update the
  poms?



Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Corey Nolet
+1 (non-binding) [for original process proposal]

Greg, the first time I've seen the word ownership on this thread is in
your message. The first time the word lead has appeared in this thread is
in your message as well. I don't think that was the intent. The PMC and
Committers have a responsibility to the community to make sure that their
patches are being reviewed and committed. I don't see in Apache's
recommended bylaws anywhere that says establishing responsibility on paper
for specific areas cannot be taken on by different members of the PMC.
What's been proposed looks, to me, to be an empirical process and it looks
like it has pretty much a consensus from the side able to give binding
votes. I don't at all this model establishes any form of ownership over
anything. I also don't see in the process proposal where it mentions that
nobody other than the persons responsible for a module can review or commit
code.

In fact, I'll go as far as to say that since Apache is a meritocracy, the
people who have been aligned to the responsibilities probably were aligned
based on some sort of meric, correct? Perhaps we could dig in and find out
for sure... I'm still getting familiar with the Spark community myself.



On Thu, Nov 6, 2014 at 7:28 PM, Patrick Wendell pwend...@gmail.com wrote:

 In fact, if you look at the subversion commiter list, the majority of
 people here have commit access only for particular areas of the
 project:

 http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS

 On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Hey Greg,
 
  Regarding subversion - I think the reference is to partial vs full
  committers here:
  https://subversion.apache.org/docs/community-guide/roles.html
 
  - Patrick
 
  On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein gst...@gmail.com wrote:
  -1 (non-binding)
 
  This is an idea that runs COMPLETELY counter to the Apache Way, and is
  to be severely frowned up. This creates *unequal* ownership of the
  codebase.
 
  Each Member of the PMC should have *equal* rights to all areas of the
  codebase until their purview. It should not be subjected to others'
  ownership except throught the standard mechanisms of reviews and
  if/when absolutely necessary, to vetos.
 
  Apache does not want leads, benevolent dictators or assigned
  maintainers, no matter how you may dress it up with multiple
  maintainers per component. The fact is that this creates an unequal
  level of ownership and responsibility. The Board has shut down
  projects that attempted or allowed for Leads. Just a few months ago,
  there was a problem with somebody calling themself a Lead.
 
  I don't know why you suggest that Apache Subversion does this. We
  absolutely do not. Never have. Never will. The Subversion codebase is
  owned by all of us, and we all care for every line of it. Some people
  know more than others, of course. But any one of us, can change any
  part, without being subjected to a maintainer. Of course, we ask
  people with more knowledge of the component when we feel
  uncomfortable, but we also know when it is safe or not to make a
  specific change. And *always*, our fellow committers can review our
  work and let us know when we've done something wrong.
 
  Equal ownership reduces fiefdoms, enhances a feeling of community and
  project ownership, and creates a more open and inviting project.
 
  So again: -1 on this entire concept. Not good, to be polite.
 
  Regards,
  Greg Stein
  Director, Vice Chairman
  Apache Software Foundation
 
  On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote:
  Hi all,
 
  I wanted to share a discussion we've been having on the PMC list, as
 well as call for an official vote on it on a public list. Basically, as the
 Spark project scales up, we need to define a model to make sure there is
 still great oversight of key components (in particular internal
 architecture and public APIs), and to this end I've proposed implementing a
 maintainer model for some of these components, similar to other large
 projects.
 
  As background on this, Spark has grown a lot since joining Apache.
 We've had over 80 contributors/month for the past 3 months, which I believe
 makes us the most active project in contributors/month at Apache, as well
 as over 500 patches/month. The codebase has also grown significantly, with
 new libraries for SQL, ML, graphs and more.
 
  In this kind of large project, one common way to scale development is
 to assign maintainers to oversee key components, where each patch to that
 component needs to get sign-off from at least one of its maintainers. Most
 existing large projects do this -- at Apache, some large ones with this
 model are CloudStack (the second-most active project overall), Subversion,
 and Kafka, and other examples include Linux and Python. This is also
 by-and-large how Spark operates today -- most components have a de-facto
 maintainer.
 
  IMO, adopting this model would have two 

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Corey Nolet
PMC [1] is responsible for oversight and does not designate partial or full
committer. There are projects where all committers become PMC and others
where PMC is reserved for committers with the most merit (and willingness
to take on the responsibility of project oversight, releases, etc...).
Community maintains the codebase through committers. Committers to mentor,
roll in patches, and spread the project throughout other communities.

Adding someone's name to a list as a maintainer is not a barrier. With a
community as large as Spark's, and myself not being a committer on this
project, I see it as a welcome opportunity to find a mentor in the areas in
which I'm interested in contributing. We'd expect the list of names to grow
as more volunteers gain more interest, correct? To me, that seems quite
contrary to a barrier.

[1] http://www.apache.org/dev/pmc.html


On Thu, Nov 6, 2014 at 7:49 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 So I don't understand, Greg, are the partial committers committers, or are
 they not? Spark also has a PMC, but our PMC currently consists of all
 committers (we decided not to have a differentiation when we left the
 incubator). I see the Subversion partial committers listed as committers
 on https://people.apache.org/committers-by-project.html#subversion, so I
 assume they are committers. As far as I can see, CloudStack is similar.

 Matei

  On Nov 6, 2014, at 4:43 PM, Greg Stein gst...@gmail.com wrote:
 
  Partial committers are people invited to work on a particular area, and
 they do not require sign-off to work on that area. They can get a sign-off
 and commit outside that area. That approach doesn't compare to this
 proposal.
 
  Full committers are PMC members. As each PMC member is responsible for
 *every* line of code, then every PMC member should have complete rights to
 every line of code. Creating disparity flies in the face of a PMC member's
 responsibility. If I am a Spark PMC member, then I have responsibility for
 GraphX code, whether my name is Ankur, Joey, Reynold, or Greg. And
 interposing a barrier inhibits my responsibility to ensure GraphX is
 designed, maintained, and delivered to the Public.
 
  Cheers,
  -g
 
  (and yes, I'm aware of COMMITTERS; I've been changing that file for the
 past 12 years :-) )
 
  On Thu, Nov 6, 2014 at 6:28 PM, Patrick Wendell pwend...@gmail.com
 mailto:pwend...@gmail.com wrote:
  In fact, if you look at the subversion commiter list, the majority of
  people here have commit access only for particular areas of the
  project:
 
  http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS 
 http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS
 
  On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell pwend...@gmail.com
 mailto:pwend...@gmail.com wrote:
   Hey Greg,
  
   Regarding subversion - I think the reference is to partial vs full
   committers here:
   https://subversion.apache.org/docs/community-guide/roles.html 
 https://subversion.apache.org/docs/community-guide/roles.html
  
   - Patrick
  
   On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein gst...@gmail.com mailto:
 gst...@gmail.com wrote:
   -1 (non-binding)
  
   This is an idea that runs COMPLETELY counter to the Apache Way, and is
   to be severely frowned up. This creates *unequal* ownership of the
   codebase.
  
   Each Member of the PMC should have *equal* rights to all areas of the
   codebase until their purview. It should not be subjected to others'
   ownership except throught the standard mechanisms of reviews and
   if/when absolutely necessary, to vetos.
  
   Apache does not want leads, benevolent dictators or assigned
   maintainers, no matter how you may dress it up with multiple
   maintainers per component. The fact is that this creates an unequal
   level of ownership and responsibility. The Board has shut down
   projects that attempted or allowed for Leads. Just a few months ago,
   there was a problem with somebody calling themself a Lead.
  
   I don't know why you suggest that Apache Subversion does this. We
   absolutely do not. Never have. Never will. The Subversion codebase is
   owned by all of us, and we all care for every line of it. Some people
   know more than others, of course. But any one of us, can change any
   part, without being subjected to a maintainer. Of course, we ask
   people with more knowledge of the component when we feel
   uncomfortable, but we also know when it is safe or not to make a
   specific change. And *always*, our fellow committers can review our
   work and let us know when we've done something wrong.
  
   Equal ownership reduces fiefdoms, enhances a feeling of community and
   project ownership, and creates a more open and inviting project.
  
   So again: -1 on this entire concept. Not good, to be polite.
  
   Regards,
   Greg Stein
   Director, Vice Chairman
   Apache Software Foundation
  
   On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote:
   Hi all,
  
   I wanted to share a 

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Corey Nolet
I'm actually going to change my non-binding to +0 for the proposal as-is.

I overlooked some parts of the original proposal that, when reading over
them again, do not sit well with me. one of the maintainers needs to sign
off on each patch to the component, as Greg has pointed out, does seem to
imply that there are committers with more power than others with regards to
specific components- which does imply ownership.

My thinking would be to re-work in some way as to take out the accent on
ownership. I would maybe focus on things such as:

1) Other committers and contributors being forced to consult with
maintainers of modules before patches can get rolled in.
2) Maintainers being assigned specifically from PMC.
3) Oversight to have more accent on keeping the community happy in a
specific area of interest vice being a consultant for the design of a
specific piece.

On Thu, Nov 6, 2014 at 8:46 PM, Arun C Murthy a...@hortonworks.com wrote:

 With my ASF Member hat on, I fully agree with Greg.

 As he points out, this is an anti-pattern in the ASF and is severely
 frowned upon.

 We, in Hadoop, had a similar trajectory where we had were politely told to
 go away from having sub-project committers (HDFS, MapReduce etc.) to a
 common list of committers. There were some concerns initially, but we have
 successfully managed to work together and build a more healthy community as
 a result of following the advice on the ASF Way.

 I do have sympathy for good oversight etc. as the project grows and
 attracts many contributors - it's essentially the need to have smaller,
 well-knit developer communities. One way to achieve that would be to have
 separate TLPs  (e.g. Spark, MLLIB, GraphX) with separate committer lists
 for each representing the appropriate community. Hadoop went a similar
 route where we had Pig, Hive, HBase etc. as sub-projects initially and then
 split them into TLPs with more focussed communities to the benefit of
 everyone. Maybe you guys want to try this too?

 

 Few more observations:
 # In general, *discussions* on project directions (such as new concept of
 *maintainers*) should happen first on the public lists *before* voting, not
 in the private PMC list.
 # If you chose to go this route in spite of this advice, seems to me Spark
 would be better of having more maintainers per component (at least 4-5),
 probably with a lot more diversity in terms of affiliations. Not sure if
 that is a concern - do you have good diversity in the proposed list? This
 will ensure that there are no concerns about a dominant employer
 controlling a project.

 

 Hope this helps - we've gone through similar journey, got through similar
 issues and fully embraced the Apache Way (™) as Greg points out to our
 benefit.

 thanks,
 Arun


 On Nov 6, 2014, at 4:18 PM, Greg Stein gst...@gmail.com wrote:

  -1 (non-binding)
 
  This is an idea that runs COMPLETELY counter to the Apache Way, and is
  to be severely frowned up. This creates *unequal* ownership of the
  codebase.
 
  Each Member of the PMC should have *equal* rights to all areas of the
  codebase until their purview. It should not be subjected to others'
  ownership except throught the standard mechanisms of reviews and
  if/when absolutely necessary, to vetos.
 
  Apache does not want leads, benevolent dictators or assigned
  maintainers, no matter how you may dress it up with multiple
  maintainers per component. The fact is that this creates an unequal
  level of ownership and responsibility. The Board has shut down
  projects that attempted or allowed for Leads. Just a few months ago,
  there was a problem with somebody calling themself a Lead.
 
  I don't know why you suggest that Apache Subversion does this. We
  absolutely do not. Never have. Never will. The Subversion codebase is
  owned by all of us, and we all care for every line of it. Some people
  know more than others, of course. But any one of us, can change any
  part, without being subjected to a maintainer. Of course, we ask
  people with more knowledge of the component when we feel
  uncomfortable, but we also know when it is safe or not to make a
  specific change. And *always*, our fellow committers can review our
  work and let us know when we've done something wrong.
 
  Equal ownership reduces fiefdoms, enhances a feeling of community and
  project ownership, and creates a more open and inviting project.
 
  So again: -1 on this entire concept. Not good, to be polite.
 
  Regards,
  Greg Stein
  Director, Vice Chairman
  Apache Software Foundation
 
  On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote:
  Hi all,
 
  I wanted to share a discussion we've been having on the PMC list, as
 well as call for an official vote on it on a public list. Basically, as the
 Spark project scales up, we need to define a model to make sure there is
 still great oversight of key components (in particular internal
 architecture and public APIs), and to this end I've proposed 

Re: Raise Java dependency from 6 to 7

2014-10-19 Thread Corey Nolet
A concrete plan and a definite version upon which the upgrade would be
applied sounds like it would benefit the community. If you plan far enough
out (as Hadoop has done) and give the community enough of a notice, I can't
see it being a problem as they would have ample time upgrade.



On Sat, Oct 18, 2014 at 9:20 PM, Marcelo Vanzin van...@cloudera.com wrote:

 Hadoop, for better or worse, depends on an ancient version of Jetty
 (6), that is even on a different package. So Spark (or anyone trying
 to use a newer Jetty) is lucky on that front...

 IIRC Hadoop is planning to move to Java 7-only starting with 2.7. Java
 7 is also supposed to be EOL some time next year, so a plan to move to
 Java 7 and, eventually, Java 8 would be nice.

 On Sat, Oct 18, 2014 at 5:44 PM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
  I'd also wait a bit until these are gone. Jetty is unfortunately a much
 hairier topic by the way, because the Hadoop libraries also depend on
 Jetty. I think it will be hard to update. However, a patch that shades
 Jetty might be nice to have, if that doesn't require shading a lot of other
 stuff.
 
  Matei
 
  On Oct 18, 2014, at 4:37 PM, Koert Kuipers ko...@tresata.com wrote:
 
  my experience is that there are still a lot of java 6 clusters out
 there.
  also distros that bundle spark still support java 6
  On Oct 17, 2014 8:01 PM, Andrew Ash and...@andrewash.com wrote:
 
  Hi Spark devs,
 
  I've heard a few times that keeping support for Java 6 is a priority
 for
  Apache Spark.  Given that Java 6 has been publicly EOL'd since Feb 2013
  http://www.oracle.com/technetwork/java/eol-135779.html and the last
  public update was Apr 2013
  https://en.wikipedia.org/wiki/Java_version_history#Java_6_updates,
 why
  are we still maintaing support for 6?  The only people using it now
 must be
  paying for the extended support to continue receiving security fixes.
 
  Bumping the lower bound of Java versions up to Java 7 would allow us to
  upgrade from Jetty 8 to 9, which is currently a conflict with the
  Dropwizard framework and a personal pain point.
 
  Java 6 vs 7 for Spark links:
  Try with resources
  https://github.com/apache/spark/pull/2575/files#r18152125 for
  SparkContext et al
  Upgrade to Jetty 9
  https://github.com/apache/spark/pull/167#issuecomment-54544494
  Warn when not compiling with Java6
  https://github.com/apache/spark/pull/859
 
 
  Who are the people out there that still need Java 6 support?
 
  Thanks!
  Andrew
 
 
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 



 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org