Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread Corey Nolet
David,

Thank you very much for announcing this! It looks like it could be very
useful. Would you mind providing a link to the github?

On Tue, Jan 12, 2016 at 10:03 AM, David 
wrote:

> Hi all,
>
> I'd like to share news of the recent release of a new Spark package, ROSE.
>
>
> ROSE is a Scala library offering access to the full scientific computing
> power of the R programming language to Apache Spark batch and streaming
> applications on the JVM. Where Apache SparkR lets data scientists use Spark
> from R, ROSE is designed to let Scala and Java developers use R from Spark.
>
> The project is available and documented on GitHub and I would encourage
> you to take a look. Any feedback, questions etc very welcome.
>
> David
>
> "All that is gold does not glitter, Not all those who wander are lost."
>


Re: Welcoming two new committers

2016-02-08 Thread Corey Nolet
Congrats guys!

On Mon, Feb 8, 2016 at 12:23 PM, Ted Yu  wrote:

> Congratulations, Herman and Wenchen.
>
> On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> The PMC has recently added two new Spark committers -- Herman van Hovell
>> and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten,
>> adding new features, optimizations and APIs. Please join me in welcoming
>> Herman and Wenchen.
>>
>> Matei
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: [ANNOUNCE] Spark 1.2.0 Release Preview Posted

2014-11-20 Thread Corey Nolet
I was actually about to post this myself- I have a complex join that could
benefit from something like a GroupComparator vs having to do multiple
grouyBy operations. This is probably the wrong thread for a full discussion
on this but I didn't see a JIRA ticket for this or anything similar- any
reasons why this would not make sense given Spark's design?

On Thu, Nov 20, 2014 at 9:39 AM, Madhu  wrote:

> Thanks Patrick.
>
> I've been testing some 1.2 features, looks good so far.
> I have some example code that I think will be helpful for certain MR-style
> use cases (secondary sort).
> Can I still add that to the 1.2 documentation, or is that frozen at this
> point?
>
>
>
> -
> --
> Madhu
> https://www.linkedin.com/in/msiddalingaiah
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/ANNOUNCE-Spark-1-2-0-Release-Preview-Posted-tp9400p9449.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Spark SQL API changes and stabilization

2015-01-15 Thread Corey Nolet
Reynold,

One thing I'd like worked into the public portion of the API is the json
inferencing logic that creates a Set[(String, StructType)] out of
Map[String,Any]. SPARK-5260 addresses this so that I can use Accumulators
to infer my schema instead of forcing a map/reduce phase to occur on an RDD
in order to get the final schema. Do you (or anyone else) see a path
forward in exposing this to users? A utility class perhaps?

On Thu, Jan 15, 2015 at 1:33 PM, Reynold Xin  wrote:

> Alex,
>
> I didn't communicate properly. By "private", I simply meant the expectation
> that it is not a public API. The plan is to still omit it from the
> scaladoc/javadoc generation, but no language visibility modifier will be
> applied on them.
>
> After 1.3, you will likely no longer need to use things in sql.catalyst
> package directly. Programmatically construct SchemaRDDs is going to be a
> first class public API. Data types have already been moved out of the
> sql.catalyst package and now lives in sql.types. They are becoming stable
> public APIs. When the "data frame" patch is submitted, you will see a
> public expression library also. There will be few reason for end users or
> library developers to hook into things in sql.catalyst. For the bravest and
> the most advanced, they can still use them, with the expectation that it is
> subject to change.
>
>
>
>
>
> On Thu, Jan 15, 2015 at 7:53 AM, Alessandro Baretta  >
> wrote:
>
> > Reynold,
> >
> > Thanks for the heads up. In general, I strongly oppose the use of
> > "private" to restrict access to certain parts of the API, the reason
> being
> > that I might find the need to use some of the internals of a library from
> > my own project. I find that a @DeveloperAPI annotation serves the same
> > purpose as "private" without imposing unnecessary restrictions: it
> > discourages people from using the annotated API and reserves the right
> for
> > the core developers to change it suddenly in backwards incompatible ways.
> >
> > In particular, I would like to express the desire that the APIs to
> > programmatically construct SchemaRDDs from an RDD[Row] and a StructType
> > remain public. All the SparkSQL data type objects should be exposed by
> the
> > API, and the jekyll build should not hide the docs as it does now.
> >
> > Thanks.
> >
> > Alex
> >
> > On Wed, Jan 14, 2015 at 9:45 PM, Reynold Xin 
> wrote:
> >
> >> Hi Spark devs,
> >>
> >> Given the growing number of developers that are building on Spark SQL,
> we
> >> would like to stabilize the API in 1.3 so users and developers can be
> >> confident to build on it. This also gives us a chance to improve the
> API.
> >>
> >> In particular, we are proposing the following major changes. This should
> >> have no impact for most users (i.e. those running SQL through the JDBC
> >> client or SQLContext.sql method).
> >>
> >> 1. Everything in sql.catalyst package is private to the project.
> >>
> >> 2. Redesign SchemaRDD DSL (SPARK-5097): We initially added the DSL for
> >> SchemaRDD and logical plans in order to construct test cases. We have
> >> received feedback from a lot of users that the DSL can be incredibly
> >> powerful. In 1.3, we’d like to refactor the DSL to make it suitable for
> >> not
> >> only constructing test cases, but also in everyday data pipelines. The
> new
> >> SchemaRDD API is inspired by the data frame concept in Pandas and R.
> >>
> >> 3. Reconcile Java and Scala APIs (SPARK-5193): We would like to expose
> one
> >> set of APIs that will work for both Java and Scala. The current Java API
> >> (sql.api.java) does not share any common ancestor with the Scala API.
> This
> >> led to high maintenance burden for us as Spark developers and for
> library
> >> developers. We propose to eliminate the Java specific API, and simply
> work
> >> on the existing Scala API to make it also usable for Java. This will
> make
> >> Java a first class citizen as Scala. This effectively means that all
> >> public
> >> classes should be usable for both Scala and Java, including SQLContext,
> >> HiveContext, SchemaRDD, data types, and the aforementioned DSL.
> >>
> >>
> >> Again, this should have no impact on most users since the existing DSL
> is
> >> rarely used by end users. However, library developers might need to
> change
> >> the import statements because we are moving certain classes around. We
> >> will
> >> keep you posted as patches are merged.
> >>
> >
> >
>


Re: Welcoming three new committers

2015-02-03 Thread Corey Nolet
Congrats guys!

On Tue, Feb 3, 2015 at 7:01 PM, Evan Chan  wrote:

> Congrats everyone!!!
>
> On Tue, Feb 3, 2015 at 3:17 PM, Timothy Chen  wrote:
> > Congrats all!
> >
> > Tim
> >
> >
> >> On Feb 4, 2015, at 7:10 AM, Pritish Nawlakhe <
> prit...@nirvana-international.com> wrote:
> >>
> >> Congrats and welcome back!!
> >>
> >>
> >>
> >> Thank you!!
> >>
> >> Regards
> >> Pritish
> >> Nirvana International Inc.
> >>
> >> Big Data, Hadoop, Oracle EBS and IT Solutions
> >> VA - SWaM, MD - MBE Certified Company
> >> prit...@nirvana-international.com
> >> http://www.nirvana-international.com
> >> Twitter: @nirvanainternat
> >>
> >> -Original Message-
> >> From: Hari Shreedharan [mailto:hshreedha...@cloudera.com]
> >> Sent: Tuesday, February 3, 2015 6:02 PM
> >> To: Ted Yu
> >> Cc: Nicholas Chammas; dev; Joseph Bradley; Cheng Lian; Matei Zaharia;
> Sean Owen
> >> Subject: Re: Welcoming three new committers
> >>
> >> Congrats Cheng, Joseph and Owen! Well done!
> >>
> >>
> >>
> >>
> >> Thanks, Hari
> >>
> >>> On Tue, Feb 3, 2015 at 2:55 PM, Ted Yu  wrote:
> >>>
> >>> Congratulations, Cheng, Joseph and Sean.
> >>> On Tue, Feb 3, 2015 at 2:53 PM, Nicholas Chammas
> >>>   wrote:
>  Congratulations guys!
> 
>  On Tue Feb 03 2015 at 2:36:12 PM Matei Zaharia
>  
>  wrote:
> 
> > Hi all,
> >
> > The PMC recently voted to add three new committers: Cheng Lian,
> > Joseph Bradley and Sean Owen. All three have been major
> > contributors to Spark in the past year: Cheng on Spark SQL, Joseph
> > on MLlib, and Sean on ML and
>  many
> > pieces throughout Spark Core. Join me in welcoming them as
> committers!
> >
> > Matei
> > ---
> > -- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For
> > additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Replacing Jetty with TomCat

2015-02-17 Thread Corey Nolet
Niranda,

I'm not sure if I'd say Spark's use of Jetty to expose its UI monitoring
layer constitutes a use of "two web servers in a single product". Hadoop
uses Jetty as well as do many other applications today that need embedded
http layers for serving up their monitoring UI to users. This is completely
aside from any web container an application developer would use to interact
with Spark and Hadoop and service domain-specific content to users. The two
are disjoint.

Many applications use Thrift as a means of establishing socket connections
between clients and across servers. One alternative to Thrift is Protobuf.
You wouldn't say "I want to swap out thrift for protobuf in Cassandra
because I want to use protobuf in my application and there shouldn't be two
different socket layer abstractions on my cluster."

I could understand wanting to do this if you were being forced to deploy a
war file to a web container in order to do the monitoring but Spark's UI is
embedded within the code. If you are worried about having the Jetty
libraries on your classpath, you can exclude the Jetty dependencies from
your servlet code if you want to interact with a SparkContext in Tomcat.



On Tue, Feb 17, 2015 at 10:22 PM, Patrick Wendell 
wrote:

> Hey Niranda,
>
> It seems to me a lot of effort to support multiple libraries inside of
> Spark like this, so I'm not sure that's a great solution.
>
> If you are building an application that embeds Spark, is it not
> possible for you to continue to use Jetty for Spark's internal servers
> and use tomcat for your own server's? I would guess that many complex
> applications end up embedding multiple server libraries in various
> places (Spark itself has different transport mechanisms, etc.)
>
> - Patrick
>
> On Tue, Feb 17, 2015 at 7:14 PM, Niranda Perera
>  wrote:
> > Hi Sean,
> > The main issue we have is, running two web servers in a single product.
> we
> > think it would not be an elegant solution.
> >
> > Could you please point me to the main areas where jetty server is tightly
> > coupled or extension points where I could plug tomcat instead of jetty?
> > If successful I could contribute it to the spark project. :-)
> >
> > cheers
> >
> >
> >
> > On Mon, Feb 16, 2015 at 4:51 PM, Sean Owen  wrote:
> >
> >> There's no particular reason you have to remove the embedded Jetty
> >> server, right? it doesn't prevent you from using it inside another app
> >> that happens to run in Tomcat. You won't be able to switch it out
> >> without rewriting a fair bit of code, no, but you don't need to.
> >>
> >> On Mon, Feb 16, 2015 at 5:08 AM, Niranda Perera
> >>  wrote:
> >> > Hi,
> >> >
> >> > We are thinking of integrating Spark server inside a product. Our
> current
> >> > product uses Tomcat as its webserver.
> >> >
> >> > Is it possible to switch the Jetty webserver in Spark to Tomcat
> >> > off-the-shelf?
> >> >
> >> > Cheers
> >> >
> >> > --
> >> > Niranda
> >>
> >
> >
> >
> > --
> > Niranda
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-19 Thread Corey Nolet
+1 (non-binding)

- Verified signatures using [1]
- Built on MacOSX Yosemite
- Built on Fedora 21

Each build was run with and Hadoop-2.4 version with yarn, hive, and
hive-thriftserver profiles

I am having trouble getting all the tests passing on a single run on both
machines but we have this same problem on other projects as well.

[1] https://github.com/cjnolet/nexus-staging-gpg-verify


On Wed, Feb 18, 2015 at 6:25 PM, Sean Owen  wrote:

> On Wed, Feb 18, 2015 at 6:13 PM, Patrick Wendell 
> wrote:
> >> Patrick this link gives a 404:
> >> https://people.apache.org/keys/committer/pwendell.asc
> >
> > Works for me. Maybe it's some ephemeral issue?
>
> Yes works now; I swear it didn't before! that's all set now. The
> signing key is in that file.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Corey Nolet
This vote was supposed to close on Saturday but it looks like no PMCs voted
(other than the implicit vote from Patrick). Was there a discussion offline
to cut an RC2? Was the vote extended?

On Mon, Feb 23, 2015 at 6:59 AM, Robin East  wrote:

> Running ec2 launch scripts gives me the following error:
>
> ssl.SSLError: [Errno 1] _ssl.c:504: error:14090086:SSL
> routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
>
> Full stack trace at
> https://gist.github.com/insidedctm/4d41600bc22560540a26
>
> I’m running OSX Mavericks 10.9.5
>
> I’ll investigate further but wondered if anyone else has run into this.
>
> Robin


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Corey Nolet
Thanks Sean. I glossed over the comment about SPARK-5669.

On Mon, Feb 23, 2015 at 9:05 AM, Sean Owen  wrote:

> Yes my understanding from Patrick's comment is that this RC will not
> be released, but, to keep testing. There's an implicit -1 out of the
> gates there, I believe, and so the vote won't pass, so perhaps that's
> why there weren't further binding votes. I'm sure that will be
> formalized shortly.
>
> FWIW here are 10 issues still listed as blockers for 1.3.0:
>
> SPARK-5910 DataFrame.selectExpr("col as newName") does not work
> SPARK-5904 SPARK-5166 DataFrame methods with varargs do not work in Java
> SPARK-5873 Can't see partially analyzed plans
> SPARK-5546 Improve path to Kafka assembly when trying Kafka Python API
> SPARK-5517 SPARK-5166 Add input types for Java UDFs
> SPARK-5463 Fix Parquet filter push-down
> SPARK-5310 SPARK-5166 Update SQL programming guide for 1.3
> SPARK-5183 SPARK-5180 Document data source API
> SPARK-3650 Triangle Count handles reverse edges incorrectly
> SPARK-3511 Create a RELEASE-NOTES.txt file in the repo
>
>
> On Mon, Feb 23, 2015 at 1:55 PM, Corey Nolet  wrote:
> > This vote was supposed to close on Saturday but it looks like no PMCs
> voted
> > (other than the implicit vote from Patrick). Was there a discussion
> offline
> > to cut an RC2? Was the vote extended?
>


Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Corey Nolet
+1 (non-binding)

- Verified signatures
- Built on Mac OS X and Fedora 21.

On Mon, Mar 9, 2015 at 11:01 PM, Krishna Sankar  wrote:

> Excellent, Thanks Xiangrui. The mystery is solved.
> Cheers
> 
>
>
> On Mon, Mar 9, 2015 at 3:30 PM, Xiangrui Meng  wrote:
>
> > Krishna, I tested your linear regression example. For linear
> > regression, we changed its objective function from 1/n * \|A x -
> > b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least
> > squares formulations. It means you could re-produce the same result by
> > multiplying the step size by 2. This is not a problem if both run
> > until convergence (if not blow up). However, in your example, a very
> > small step size is chosen and it didn't converge in 100 iterations. In
> > this case, the step size matters. I will put a note in the migration
> > guide. Thanks! -Xiangrui
> >
> > On Mon, Mar 9, 2015 at 1:38 PM, Sean Owen  wrote:
> > > I'm +1 as I have not heard of any one else seeing the Hive test
> > > failure, which is likely a test issue rather than code issue anyway,
> > > and not a blocker.
> > >
> > > On Fri, Mar 6, 2015 at 9:36 PM, Sean Owen  wrote:
> > >> Although the problem is small, especially if indeed the essential docs
> > >> changes are following just a couple days behind the final release, I
> > >> mean, why the rush if they're essential? wait a couple days, finish
> > >> them, make the release.
> > >>
> > >> Answer is, I think these changes aren't actually essential given the
> > >> comment from tdas, so: just mark these Critical? (although ... they do
> > >> say they're changes for the 1.3 release, so kind of funny to get to
> > >> them for 1.3.x or 1.4, but that's not important now.)
> > >>
> > >> I thought that Blocker really meant Blocker in this project, as I've
> > >> been encouraged to use it to mean "don't release without this." I
> > >> think we should use it that way. Just thinking of it as "extra
> > >> Critical" doesn't add anything. I don't think Documentation should be
> > >> special-cased as less important, and I don't think there's confusion
> > >> if Blocker means what it says, so I'd 'fix' that way.
> > >>
> > >> If nobody sees the Hive failure I observed, and if we can just zap
> > >> those "Blockers" one way or the other, +1
> > >>
> > >>
> > >> On Fri, Mar 6, 2015 at 9:17 PM, Patrick Wendell 
> > wrote:
> > >>> Sean,
> > >>>
> > >>> The docs are distributed and consumed in a fundamentally different
> way
> > >>> than Spark code itself. So we've always considered the "deadline" for
> > >>> doc changes to be when the release is finally posted.
> > >>>
> > >>> If there are small inconsistencies with the docs present in the
> source
> > >>> code for that release tag, IMO that doesn't matter much since we
> don't
> > >>> even distribute the docs with Spark's binary releases and virtually
> no
> > >>> one builds and hosts the docs on their own (that I am aware of, at
> > >>> least). Perhaps we can recommend if people want to build the doc
> > >>> sources that they should always grab the head of the most recent
> > >>> release branch, to set expectations accordingly.
> > >>>
> > >>> In the past we haven't considered it worth holding up the release
> > >>> process for the purpose of the docs. It just doesn't make sense since
> > >>> they are consumed "as a service". If we decide to change this
> > >>> convention, it would mean shipping our releases later, since we
> > >>> could't pipeline the doc finalization with voting.
> > >>>
> > >>> - Patrick
> > >>>
> > >>> On Fri, Mar 6, 2015 at 11:02 AM, Sean Owen 
> wrote:
> >  Given the title and tagging, it sounds like there could be some
> >  must-have doc changes to go with what is being released as 1.3. It
> can
> >  be finished later, and published later, but then the docs source
> >  shipped with the release doesn't match the site, and until then, 1.3
> >  is released without some "must-have" docs for 1.3 on the site.
> > 
> >  The real question to me is: are there any further, absolutely
> >  essential doc changes that need to accompany 1.3 or not?
> > 
> >  If not, just resolve these. If there are, then it seems like the
> >  release has to block on them. If there are some docs that should
> have
> >  gone in for 1.3, but didn't, but aren't essential, well I suppose it
> >  bears thinking about how to not slip as much work, but it doesn't
> >  block.
> > 
> >  I think Documentation issues certainly can be a blocker and
> shouldn't
> >  be specially ignored.
> > 
> > 
> >  BTW the UISeleniumSuite issue is a real failure, but I do not think
> it
> >  is serious: http://issues.apache.org/jira/browse/SPARK-6205  It
> isn't
> >  a regression from 1.2.x, but only affects tests, and only affects a
> >  subset of build profiles.
> > 
> > 
> > 
> > 
> >  On Fri, Mar 6, 2015 at 6:43 PM, Patrick Wendell  >
> > wrote:
> > > Hey Sean,
> > >
>

Re: Welcoming some new committers

2015-06-20 Thread Corey Nolet
Congrats guys! Keep up the awesome work!

On Sat, Jun 20, 2015 at 3:28 PM, Guru Medasani  wrote:

> Congratulations to all the new committers!
>
> Guru Medasani
> gdm...@gmail.com
>
>
>
> > On Jun 17, 2015, at 5:12 PM, Matei Zaharia 
> wrote:
> >
> > Hey all,
> >
> > Over the past 1.5 months we added a number of new committers to the
> project, and I wanted to welcome them now that all of their respective
> forms, accounts, etc are in. Join me in welcoming the following new
> committers:
> >
> > - Davies Liu
> > - DB Tsai
> > - Kousuke Saruta
> > - Sandy Ryza
> > - Yin Huai
> >
> > Looking forward to more great contributions from all of these folks.
> >
> > Matei
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: MongoDB and Spark

2015-09-11 Thread Corey Nolet
Unfortunately, MongoDB does not directly expose its locality via its client
API so the problem with trying to schedule Spark tasks against it is that
the tasks themselves cannot be scheduled locally on nodes containing query
results- which means you can only assume most results will be sent over the
network to the task that needs to process it. This is bad. The other reason
(which is also related to the issue of locality) is that I'm not sure if
there's an easy way to spread the results of a query over multiple
different clients- thus you'd probably have to start your Spark RDD with a
single partition and then repartition. What you've done at that point is
you've taken data from multiple mongodb nodes and you've collected them on
a single node just to re-partition them, again across the network, onto
multiple nodes. This is also bad.

I think this is the reason it was recommended to use MongoDB's mapreduce
because they can use their locality information internally. I had this same
issue w/ Couchbase a couple years back- it's unfortunate but it's the
reality.




On Fri, Sep 11, 2015 at 9:34 AM, Sandeep Giri 
wrote:

> I think it should be possible by loading collections as RDD and then doing
> a union on them.
>
> Regards,
> Sandeep Giri,
> +1 347 781 4573 (US)
> +91-953-899-8962 (IN)
>
> www.KnowBigData.com. 
> Phone: +1-253-397-1945 (Office)
>
> [image: linkedin icon]  [image:
> other site icon]   [image: facebook icon]
>  [image: twitter icon]
>  
>
>
> On Fri, Sep 11, 2015 at 3:40 PM, Mishra, Abhishek <
> abhishek.mis...@xerox.com> wrote:
>
>> Anything using Spark RDD’s ???
>>
>>
>>
>> Abhishek
>>
>>
>>
>> *From:* Sandeep Giri [mailto:sand...@knowbigdata.com]
>> *Sent:* Friday, September 11, 2015 3:19 PM
>> *To:* Mishra, Abhishek; u...@spark.apache.org; dev@spark.apache.org
>> *Subject:* Re: MongoDB and Spark
>>
>>
>>
>> use map-reduce.
>>
>>
>>
>> On Fri, Sep 11, 2015, 14:32 Mishra, Abhishek 
>> wrote:
>>
>> Hello ,
>>
>>
>>
>> Is there any way to query multiple collections from mongodb using spark
>> and java.  And i want to create only one Configuration Object. Please help
>> if anyone has something regarding this.
>>
>>
>>
>>
>>
>> Thank You
>>
>> Abhishek
>>
>>
>


Re: Forecasting Library For Apache Spark

2015-09-21 Thread Corey Nolet
Mohamed,

Have you checked out the Spark Timeseries [1] project? Non-seasonal ARIMA
was added to this recently and seasonal ARIMA should be following shortly.

[1] https://github.com/cloudera/spark-timeseries

On Mon, Sep 21, 2015 at 7:47 AM, Mohamed Baddar 
wrote:

> Hello everybody , this my first mail in the List , and i would like to
> introduce my self first :)
> My Name is Mohamed baddar , I work as Big Data and Analytics Software
> Engieer at BADRIT (http://badrit.com/) , a software Startup with focus in
> Big Data , also i have been working for 6+ years at IBM R&D Egypt , in HPC
> , Big Data and Analytics Are
>
> I just have a question , i can't find supported Apache Spark library for
> forecasting using ARIMA , ETS , Bayesian model or any method , is there any
> plans for such a development , as i can't find any issue talking about it ,
> is any one interested to have/develop a related module , as i find it a
> critical feature to be added to SPARK
>
> Thanks
>


Re: Raise Java dependency from 6 to 7

2014-10-19 Thread Corey Nolet
A concrete plan and a definite version upon which the upgrade would be
applied sounds like it would benefit the community. If you plan far enough
out (as Hadoop has done) and give the community enough of a notice, I can't
see it being a problem as they would have ample time upgrade.



On Sat, Oct 18, 2014 at 9:20 PM, Marcelo Vanzin  wrote:

> Hadoop, for better or worse, depends on an ancient version of Jetty
> (6), that is even on a different package. So Spark (or anyone trying
> to use a newer Jetty) is lucky on that front...
>
> IIRC Hadoop is planning to move to Java 7-only starting with 2.7. Java
> 7 is also supposed to be EOL some time next year, so a plan to move to
> Java 7 and, eventually, Java 8 would be nice.
>
> On Sat, Oct 18, 2014 at 5:44 PM, Matei Zaharia 
> wrote:
> > I'd also wait a bit until these are gone. Jetty is unfortunately a much
> hairier topic by the way, because the Hadoop libraries also depend on
> Jetty. I think it will be hard to update. However, a patch that shades
> Jetty might be nice to have, if that doesn't require shading a lot of other
> stuff.
> >
> > Matei
> >
> >> On Oct 18, 2014, at 4:37 PM, Koert Kuipers  wrote:
> >>
> >> my experience is that there are still a lot of java 6 clusters out
> there.
> >> also distros that bundle spark still support java 6
> >> On Oct 17, 2014 8:01 PM, "Andrew Ash"  wrote:
> >>
> >>> Hi Spark devs,
> >>>
> >>> I've heard a few times that keeping support for Java 6 is a priority
> for
> >>> Apache Spark.  Given that Java 6 has been publicly EOL'd since Feb 2013
> >>>  and the last
> >>> public update was Apr 2013
> >>> ,
> why
> >>> are we still maintaing support for 6?  The only people using it now
> must be
> >>> paying for the extended support to continue receiving security fixes.
> >>>
> >>> Bumping the lower bound of Java versions up to Java 7 would allow us to
> >>> upgrade from Jetty 8 to 9, which is currently a conflict with the
> >>> Dropwizard framework and a personal pain point.
> >>>
> >>> Java 6 vs 7 for Spark links:
> >>> Try with resources
> >>>  for
> >>> SparkContext et al
> >>> Upgrade to Jetty 9
> >>> 
> >>> Warn when not compiling with Java6
> >>> 
> >>>
> >>>
> >>> Who are the people out there that still need Java 6 support?
> >>>
> >>> Thanks!
> >>> Andrew
> >>>
> >
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Corey Nolet
+1 (non-binding) [for original process proposal]

Greg, the first time I've seen the word "ownership" on this thread is in
your message. The first time the word "lead" has appeared in this thread is
in your message as well. I don't think that was the intent. The PMC and
Committers have a responsibility to the community to make sure that their
patches are being reviewed and committed. I don't see in Apache's
recommended bylaws anywhere that says establishing responsibility on paper
for specific areas cannot be taken on by different members of the PMC.
What's been proposed looks, to me, to be an empirical process and it looks
like it has pretty much a consensus from the side able to give binding
votes. I don't at all this model establishes any form of ownership over
anything. I also don't see in the process proposal where it mentions that
nobody other than the persons responsible for a module can review or commit
code.

In fact, I'll go as far as to say that since Apache is a meritocracy, the
people who have been aligned to the responsibilities probably were aligned
based on some sort of meric, correct? Perhaps we could dig in and find out
for sure... I'm still getting familiar with the Spark community myself.



On Thu, Nov 6, 2014 at 7:28 PM, Patrick Wendell  wrote:

> In fact, if you look at the subversion commiter list, the majority of
> people here have commit access only for particular areas of the
> project:
>
> http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS
>
> On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell 
> wrote:
> > Hey Greg,
> >
> > Regarding subversion - I think the reference is to partial vs full
> > committers here:
> > https://subversion.apache.org/docs/community-guide/roles.html
> >
> > - Patrick
> >
> > On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein  wrote:
> >> -1 (non-binding)
> >>
> >> This is an idea that runs COMPLETELY counter to the Apache Way, and is
> >> to be severely frowned up. This creates *unequal* ownership of the
> >> codebase.
> >>
> >> Each Member of the PMC should have *equal* rights to all areas of the
> >> codebase until their purview. It should not be subjected to others'
> >> "ownership" except throught the standard mechanisms of reviews and
> >> if/when absolutely necessary, to vetos.
> >>
> >> Apache does not want "leads", "benevolent dictators" or "assigned
> >> maintainers", no matter how you may dress it up with multiple
> >> maintainers per component. The fact is that this creates an unequal
> >> level of ownership and responsibility. The Board has shut down
> >> projects that attempted or allowed for "Leads". Just a few months ago,
> >> there was a problem with somebody calling themself a "Lead".
> >>
> >> I don't know why you suggest that Apache Subversion does this. We
> >> absolutely do not. Never have. Never will. The Subversion codebase is
> >> owned by all of us, and we all care for every line of it. Some people
> >> know more than others, of course. But any one of us, can change any
> >> part, without being subjected to a "maintainer". Of course, we ask
> >> people with more knowledge of the component when we feel
> >> uncomfortable, but we also know when it is safe or not to make a
> >> specific change. And *always*, our fellow committers can review our
> >> work and let us know when we've done something wrong.
> >>
> >> Equal ownership reduces fiefdoms, enhances a feeling of community and
> >> project ownership, and creates a more open and inviting project.
> >>
> >> So again: -1 on this entire concept. Not good, to be polite.
> >>
> >> Regards,
> >> Greg Stein
> >> Director, Vice Chairman
> >> Apache Software Foundation
> >>
> >> On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote:
> >>> Hi all,
> >>>
> >>> I wanted to share a discussion we've been having on the PMC list, as
> well as call for an official vote on it on a public list. Basically, as the
> Spark project scales up, we need to define a model to make sure there is
> still great oversight of key components (in particular internal
> architecture and public APIs), and to this end I've proposed implementing a
> maintainer model for some of these components, similar to other large
> projects.
> >>>
> >>> As background on this, Spark has grown a lot since joining Apache.
> We've had over 80 contributors/month for the past 3 months, which I believe
> makes us the most active project in contributors/month at Apache, as well
> as over 500 patches/month. The codebase has also grown significantly, with
> new libraries for SQL, ML, graphs and more.
> >>>
> >>> In this kind of large project, one common way to scale development is
> to assign "maintainers" to oversee key components, where each patch to that
> component needs to get sign-off from at least one of its maintainers. Most
> existing large projects do this -- at Apache, some large ones with this
> model are CloudStack (the second-most active project overall), Subversion,
> and Kafka, and other examples include Linux and Py

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Corey Nolet
PMC [1] is responsible for oversight and does not designate partial or full
committer. There are projects where all committers become PMC and others
where PMC is reserved for committers with the most merit (and willingness
to take on the responsibility of project oversight, releases, etc...).
Community maintains the codebase through committers. Committers to mentor,
roll in patches, and spread the project throughout other communities.

Adding someone's name to a list as a "maintainer" is not a barrier. With a
community as large as Spark's, and myself not being a committer on this
project, I see it as a welcome opportunity to find a mentor in the areas in
which I'm interested in contributing. We'd expect the list of names to grow
as more volunteers gain more interest, correct? To me, that seems quite
contrary to a "barrier".

[1] http://www.apache.org/dev/pmc.html


On Thu, Nov 6, 2014 at 7:49 PM, Matei Zaharia 
wrote:

> So I don't understand, Greg, are the partial committers committers, or are
> they not? Spark also has a PMC, but our PMC currently consists of all
> committers (we decided not to have a differentiation when we left the
> incubator). I see the Subversion partial committers listed as "committers"
> on https://people.apache.org/committers-by-project.html#subversion, so I
> assume they are committers. As far as I can see, CloudStack is similar.
>
> Matei
>
> > On Nov 6, 2014, at 4:43 PM, Greg Stein  wrote:
> >
> > Partial committers are people invited to work on a particular area, and
> they do not require sign-off to work on that area. They can get a sign-off
> and commit outside that area. That approach doesn't compare to this
> proposal.
> >
> > Full committers are PMC members. As each PMC member is responsible for
> *every* line of code, then every PMC member should have complete rights to
> every line of code. Creating disparity flies in the face of a PMC member's
> responsibility. If I am a Spark PMC member, then I have responsibility for
> GraphX code, whether my name is Ankur, Joey, Reynold, or Greg. And
> interposing a barrier inhibits my responsibility to ensure GraphX is
> designed, maintained, and delivered to the Public.
> >
> > Cheers,
> > -g
> >
> > (and yes, I'm aware of COMMITTERS; I've been changing that file for the
> past 12 years :-) )
> >
> > On Thu, Nov 6, 2014 at 6:28 PM, Patrick Wendell  > wrote:
> > In fact, if you look at the subversion commiter list, the majority of
> > people here have commit access only for particular areas of the
> > project:
> >
> > http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS <
> http://svn.apache.org/repos/asf/subversion/trunk/COMMITTERS>
> >
> > On Thu, Nov 6, 2014 at 4:26 PM, Patrick Wendell  > wrote:
> > > Hey Greg,
> > >
> > > Regarding subversion - I think the reference is to partial vs full
> > > committers here:
> > > https://subversion.apache.org/docs/community-guide/roles.html <
> https://subversion.apache.org/docs/community-guide/roles.html>
> > >
> > > - Patrick
> > >
> > > On Thu, Nov 6, 2014 at 4:18 PM, Greg Stein  gst...@gmail.com>> wrote:
> > >> -1 (non-binding)
> > >>
> > >> This is an idea that runs COMPLETELY counter to the Apache Way, and is
> > >> to be severely frowned up. This creates *unequal* ownership of the
> > >> codebase.
> > >>
> > >> Each Member of the PMC should have *equal* rights to all areas of the
> > >> codebase until their purview. It should not be subjected to others'
> > >> "ownership" except throught the standard mechanisms of reviews and
> > >> if/when absolutely necessary, to vetos.
> > >>
> > >> Apache does not want "leads", "benevolent dictators" or "assigned
> > >> maintainers", no matter how you may dress it up with multiple
> > >> maintainers per component. The fact is that this creates an unequal
> > >> level of ownership and responsibility. The Board has shut down
> > >> projects that attempted or allowed for "Leads". Just a few months ago,
> > >> there was a problem with somebody calling themself a "Lead".
> > >>
> > >> I don't know why you suggest that Apache Subversion does this. We
> > >> absolutely do not. Never have. Never will. The Subversion codebase is
> > >> owned by all of us, and we all care for every line of it. Some people
> > >> know more than others, of course. But any one of us, can change any
> > >> part, without being subjected to a "maintainer". Of course, we ask
> > >> people with more knowledge of the component when we feel
> > >> uncomfortable, but we also know when it is safe or not to make a
> > >> specific change. And *always*, our fellow committers can review our
> > >> work and let us know when we've done something wrong.
> > >>
> > >> Equal ownership reduces fiefdoms, enhances a feeling of community and
> > >> project ownership, and creates a more open and inviting project.
> > >>
> > >> So again: -1 on this entire concept. Not good, to be polite.
> > >>
> > >> Regards,
> > >> Greg Stein
> > >> Dire

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Corey Nolet
I'm actually going to change my non-binding to +0 for the proposal as-is.

I overlooked some parts of the original proposal that, when reading over
them again, do not sit well with me. "one of the maintainers needs to sign
off on each patch to the component", as Greg has pointed out, does seem to
imply that there are committers with more power than others with regards to
specific components- which does imply ownership.

My thinking would be to re-work in some way as to take out the accent on
ownership. I would maybe focus on things such as:

1) Other committers and contributors being forced to consult with
maintainers of modules before patches can get rolled in.
2) Maintainers being assigned specifically from PMC.
3) Oversight to have more accent on keeping the community happy in a
specific area of interest vice being a consultant for the design of a
specific piece.

On Thu, Nov 6, 2014 at 8:46 PM, Arun C Murthy  wrote:

> With my ASF Member hat on, I fully agree with Greg.
>
> As he points out, this is an anti-pattern in the ASF and is severely
> frowned upon.
>
> We, in Hadoop, had a similar trajectory where we had were politely told to
> go away from having sub-project committers (HDFS, MapReduce etc.) to a
> common list of committers. There were some concerns initially, but we have
> successfully managed to work together and build a more healthy community as
> a result of following the advice on the ASF Way.
>
> I do have sympathy for good oversight etc. as the project grows and
> attracts many contributors - it's essentially the need to have smaller,
> well-knit developer communities. One way to achieve that would be to have
> separate TLPs  (e.g. Spark, MLLIB, GraphX) with separate committer lists
> for each representing the appropriate community. Hadoop went a similar
> route where we had Pig, Hive, HBase etc. as sub-projects initially and then
> split them into TLPs with more focussed communities to the benefit of
> everyone. Maybe you guys want to try this too?
>
> 
>
> Few more observations:
> # In general, *discussions* on project directions (such as new concept of
> *maintainers*) should happen first on the public lists *before* voting, not
> in the private PMC list.
> # If you chose to go this route in spite of this advice, seems to me Spark
> would be better of having more maintainers per component (at least 4-5),
> probably with a lot more diversity in terms of affiliations. Not sure if
> that is a concern - do you have good diversity in the proposed list? This
> will ensure that there are no concerns about a dominant employer
> controlling a project.
>
> 
>
> Hope this helps - we've gone through similar journey, got through similar
> issues and fully embraced the Apache Way (™) as Greg points out to our
> benefit.
>
> thanks,
> Arun
>
>
> On Nov 6, 2014, at 4:18 PM, Greg Stein  wrote:
>
> > -1 (non-binding)
> >
> > This is an idea that runs COMPLETELY counter to the Apache Way, and is
> > to be severely frowned up. This creates *unequal* ownership of the
> > codebase.
> >
> > Each Member of the PMC should have *equal* rights to all areas of the
> > codebase until their purview. It should not be subjected to others'
> > "ownership" except throught the standard mechanisms of reviews and
> > if/when absolutely necessary, to vetos.
> >
> > Apache does not want "leads", "benevolent dictators" or "assigned
> > maintainers", no matter how you may dress it up with multiple
> > maintainers per component. The fact is that this creates an unequal
> > level of ownership and responsibility. The Board has shut down
> > projects that attempted or allowed for "Leads". Just a few months ago,
> > there was a problem with somebody calling themself a "Lead".
> >
> > I don't know why you suggest that Apache Subversion does this. We
> > absolutely do not. Never have. Never will. The Subversion codebase is
> > owned by all of us, and we all care for every line of it. Some people
> > know more than others, of course. But any one of us, can change any
> > part, without being subjected to a "maintainer". Of course, we ask
> > people with more knowledge of the component when we feel
> > uncomfortable, but we also know when it is safe or not to make a
> > specific change. And *always*, our fellow committers can review our
> > work and let us know when we've done something wrong.
> >
> > Equal ownership reduces fiefdoms, enhances a feeling of community and
> > project ownership, and creates a more open and inviting project.
> >
> > So again: -1 on this entire concept. Not good, to be polite.
> >
> > Regards,
> > Greg Stein
> > Director, Vice Chairman
> > Apache Software Foundation
> >
> > On Wed, Nov 05, 2014 at 05:31:58PM -0800, Matei Zaharia wrote:
> >> Hi all,
> >>
> >> I wanted to share a discussion we've been having on the PMC list, as
> well as call for an official vote on it on a public list. Basically, as the
> Spark project scales up, we need to define a model to make sure there is
> still great oversi

Spark & Hadoop 2.5.1

2014-11-14 Thread Corey Nolet
I noticed Spark 1.2.0-SNAPSHOT still has 2.4.x in the pom. Since 2.5.x is
the current stable Hadoop 2.x, would it make sense for us to update the
poms?


Re: Spark & Hadoop 2.5.1

2014-11-14 Thread Corey Nolet
In the past, I've built it by providing -Dhadoop.version=2.5.1 exactly like
you've mentioned. What prompted me to write this email was that I did not
see any documentation that told me Hadoop 2.5.1 was officially supported by
Spark (i.e. community has been using it, any bugs are being fixed, etc...).
It builds, tests pass, etc... but there could be other implications that I
have not run into based on my own use of the framework.

If we are saying that the standard procedure is to build with the
hadoop-2.4 profile and override the -Dhadoop.version property, should we
provide that on the build instructions [1] at least?

[1] http://spark.apache.org/docs/latest/building-with-maven.html

On Fri, Nov 14, 2014 at 10:46 AM, Sean Owen  wrote:

> I don't think it's necessary. You're looking at the hadoop-2.4
> profile, which works with anything >= 2.4. AFAIK there is no further
> specialization needed beyond that. The profile sets hadoop.version to
> 2.4.0 by default, but this can be overridden.
>
> On Fri, Nov 14, 2014 at 3:43 PM, Corey Nolet  wrote:
> > I noticed Spark 1.2.0-SNAPSHOT still has 2.4.x in the pom. Since 2.5.x is
> > the current stable Hadoop 2.x, would it make sense for us to update the
> > poms?
>