from:"Stephen Boesch"

Re: [VOTE] Decommissioning SPIP

2020-07-01 Thread Stephen Boesch

+1 Thx for seeing this through

On Wed, 1 Jul 2020 at 20:03, Imran Rashid 
wrote:

> +1
>
> I think this is going to be a really important feature for Spark and I'm
> glad to see Holden focusing on it.
>
> On Wed, Jul 1, 2020 at 8:38 PM Mridul Muralidharan 
> wrote:
>
>> +1
>>
>> Thanks,
>> Mridul
>>
>> On Wed, Jul 1, 2020 at 6:36 PM Hyukjin Kwon  wrote:
>>
>>> +1
>>>
>>> 2020년 7월 2일 (목) 오전 10:08, Marcelo Vanzin 님이 작성:
>>>
 I reviewed the docs and PRs from way before an SPIP was explicitly
 asked, so I'm comfortable with giving a +1 even if I haven't really
 fully read the new document,

 On Wed, Jul 1, 2020 at 6:05 PM Holden Karau 
 wrote:
 >
 > Hi Spark Devs,
 >
 > I think discussion has settled on the SPIP doc at
 https://docs.google.com/document/d/1EOei24ZpVvR7_w0BwBjOnrWRy4k-qTdIlx60FsHZSHA/edit?usp=sharing
 , design doc at
 https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit,
 or JIRA https://issues.apache.org/jira/browse/SPARK-20624, and I've
 received a request to put the SPIP up for a VOTE quickly. The discussion
 thread on the mailing list is at
 http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-SPIP-Graceful-Decommissioning-td29650.html
 .
 >
 > Normally this vote would be open for 72 hours, however since it's a
 long weekend in the US where many of the PMC members are, this vote will
 not close before July 6th at noon pacific time.
 >
 > The SPIP procedures are documented at:
 https://spark.apache.org/improvement-proposals.html. The ASF's voting
 guide is at https://www.apache.org/foundation/voting.html.
 >
 > Please vote before July 6th at noon:
 >
 > [ ] +1: Accept the proposal as an official SPIP
 > [ ] +0
 > [ ] -1: I don't think this is a good idea because ...
 >
 > I will start the voting off with a +1 from myself.
 >
 > Cheers,
 >
 > Holden



 --
 Marcelo Vanzin
 van...@gmail.com
 "Life's too short to drink cheap beer"

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Initial Decom PR for Spark 3?

2020-06-22 Thread Stephen Boesch

I guess I missed that "community decision" where the existing design
document that had been reviewed was put aside and a new SPIP  document was
required.

On Sun, 21 Jun 2020 at 19:05, Hyukjin Kwon  wrote:

> Yeah, I believe the community decided to do a SPIP for such significant
> changes. It would be best if we stick to the standard approaches.
>
> 2020년 6월 21일 (일) 오전 8:52, Holden Karau 님이 작성:
>
>> I believe so, however since Hyukjin is a committer and has asked for an
>> SPIP I'll be making an SPIP for this next week. I hope to send out the
>> draft for comment by the end of Spark summit. I'll be using the same design
>> document for the design component, so if anyone has input on the design
>> document feel free to start leaving comments there now.
>>
>> On Sat, Jun 20, 2020 at 4:23 PM Stephen Boesch  wrote:
>>
>>> Hi given there is a design doc (contrary to that common) - is this going
>>> to move forward?
>>>
>>> On Thu, 18 Jun 2020 at 18:05, Hyukjin Kwon  wrote:
>>>
>>>> Looks it had to be with SPIP and a proper design doc to discuss.
>>>>
>>>> 2020년 2월 9일 (일) 오전 1:23, Erik Erlandson 님이 작성:
>>>>
>>>>> I'd be willing to pull this in, unless others have concerns post
>>>>> branch-cut.
>>>>>
>>>>> On Tue, Feb 4, 2020 at 2:51 PM Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> Hi Y’all,
>>>>>>
>>>>>> I’ve got a K8s graceful decom PR (
>>>>>> https://github.com/apache/spark/pull/26440
>>>>>>  ) I’d love to try and get in for Spark 3, but I don’t want to push
>>>>>> on it if folks don’t think it’s worth it. I’ve been working on it since
>>>>>> 2017 and it was really close in November but then I had the crash and had
>>>>>> to step back for awhile.
>>>>>>
>>>>>> It’s effectiveness is behind a feature flag and it’s been outstanding
>>>>>> for awhile so those points are in its favour. It does however change 
>>>>>> things
>>>>>> in core which is not great.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Holden
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

Re: Initial Decom PR for Spark 3?

2020-06-20 Thread Stephen Boesch

Hi given there is a design doc (contrary to that common) - is this going to
move forward?

On Thu, 18 Jun 2020 at 18:05, Hyukjin Kwon  wrote:

> Looks it had to be with SPIP and a proper design doc to discuss.
>
> 2020년 2월 9일 (일) 오전 1:23, Erik Erlandson 님이 작성:
>
>> I'd be willing to pull this in, unless others have concerns post
>> branch-cut.
>>
>> On Tue, Feb 4, 2020 at 2:51 PM Holden Karau  wrote:
>>
>>> Hi Y’all,
>>>
>>> I’ve got a K8s graceful decom PR (
>>> https://github.com/apache/spark/pull/26440
>>>  ) I’d love to try and get in for Spark 3, but I don’t want to push on
>>> it if folks don’t think it’s worth it. I’ve been working on it since 2017
>>> and it was really close in November but then I had the crash and had to
>>> step back for awhile.
>>>
>>> It’s effectiveness is behind a feature flag and it’s been outstanding
>>> for awhile so those points are in its favour. It does however change things
>>> in core which is not great.
>>>
>>> Cheers,
>>>
>>> Holden
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>

Re: Initial Decom PR for Spark 3?

2020-06-18 Thread Stephen Boesch

Second paragraph of the PR lists the design doc.

> There is a design document at
https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit?usp=sharing

On Thu, 18 Jun 2020 at 18:05, Hyukjin Kwon  wrote:

> Looks it had to be with SPIP and a proper design doc to discuss.
>
> 2020년 2월 9일 (일) 오전 1:23, Erik Erlandson 님이 작성:
>
>> I'd be willing to pull this in, unless others have concerns post
>> branch-cut.
>>
>> On Tue, Feb 4, 2020 at 2:51 PM Holden Karau  wrote:
>>
>>> Hi Y’all,
>>>
>>> I’ve got a K8s graceful decom PR (
>>> https://github.com/apache/spark/pull/26440
>>>  ) I’d love to try and get in for Spark 3, but I don’t want to push on
>>> it if folks don’t think it’s worth it. I’ve been working on it since 2017
>>> and it was really close in November but then I had the crash and had to
>>> step back for awhile.
>>>
>>> It’s effectiveness is behind a feature flag and it’s been outstanding
>>> for awhile so those points are in its favour. It does however change things
>>> in core which is not great.
>>>
>>> Cheers,
>>>
>>> Holden
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch

Ok. Can't think of why that would happen.

Am Di., 10. Sept. 2019 um 20:26 Uhr schrieb Dhrubajyoti Hati <
dhruba.w...@gmail.com>:

> As mentioned in the very first mail:
> * same cluster it is submitted.
> * from same machine they are submitted and also from same user
> * each of them has 128 executors and 2 cores per executor with 8Gigs of
> memory each and both of them are getting that while running
>
> to clarify more let me quote what I mentioned above. *These data is taken
> from Spark-UI when the jobs are almost finished in both.*
> "What i found is the  the quantile values for median for one ran with
> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins." which
> means per task time taken is much higher in spark-submit script than
> jupyter script. This is where I am really puzzled because they are the
> exact same code. why running them two different ways vary so much in the
> execution time.
>
>
>
>
> *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028*
>
>
> On Wed, Sep 11, 2019 at 8:42 AM Stephen Boesch  wrote:
>
>> Sounds like you have done your homework to properly compare .   I'm
>> guessing the answer to the following is yes .. but in any case:  are they
>> both running against the same spark cluster with the same configuration
>> parameters especially executor memory and number of workers?
>>
>> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
>> dhruba.w...@gmail.com>:
>>
>>> No, i checked for that, hence written "brand new" jupyter notebook. Also
>>> the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
>>> compressed base64 encoded text data from a hive table and decompressing and
>>> decoding in one of the udfs. Also the time compared is from Spark UI not
>>> how long the job actually takes after submission. Its just the running time
>>> i am comparing/mentioning.
>>>
>>> As mentioned earlier, all the spark conf params even match in two
>>> scripts and that's why i am puzzled what going on.
>>>
>>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, <
>>> pmccar...@dstillery.com> wrote:
>>>
>>>> It's not obvious from what you pasted, but perhaps the juypter notebook
>>>> already is connected to a running spark context, while spark-submit needs
>>>> to get a new spot in the (YARN?) queue.
>>>>
>>>> I would check the cluster job IDs for both to ensure you're getting new
>>>> cluster tasks for each.
>>>>
>>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am facing a weird behaviour while running a python script. Here is
>>>>> what the code looks like mostly:
>>>>>
>>>>> def fn1(ip):
>>>>>some code...
>>>>> ...
>>>>>
>>>>> def fn2(row):
>>>>> ...
>>>>> some operations
>>>>> ...
>>>>> return row1
>>>>>
>>>>>
>>>>> udf_fn1 = udf(fn1)
>>>>> cdf = spark.read.table("") //hive table is of size > 500 Gigs with
>>>>> ~4500 partitions
>>>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>>>>> .drop("colz") \
>>>>> .withColumnRenamed("colz", "coly")
>>>>>
>>>>> edf = ddf \
>>>>> .filter(ddf.colp == 'some_value') \
>>>>> .rdd.map(lambda row: fn2(row)) \
>>>>> .toDF()
>>>>>
>>>>> print edf.count() // simple way for the performance test in both
>>>>> platforms
>>>>>
>>>>> Now when I run the same code in a brand new jupyter notebook it runs
>>>>> 6x faster than when I run this python script using spark-submit. The
>>>>> configurations are printed and  compared from both the platforms and they
>>>>> are exact same. I even tried to run this script in a single cell of 
>>>>> jupyter
>>>>> notebook and still have the same performance. I need to understand if I am
>>>>> missing something in the spark-submit which is causing the issue.  I tried
>>>>> to minimise the script to reproduce the same error without much code.
>>>>>
>>>>> Both are run in client mode on a yarn based spark cluster. The
>>>>> machines from which both are executed are also the same and from same 
>>>>> user.
>>>>>
>>>>> What i found is the  the quantile values for median for one ran with
>>>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am 
>>>>> not
>>>>> able to figure out why this is happening.
>>>>>
>>>>> Any one faced this kind of issue before or know how to resolve this?
>>>>>
>>>>> *Regards,*
>>>>> *Dhrub*
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> *Patrick McCarthy  *
>>>>
>>>> Senior Data Scientist, Machine Learning Engineering
>>>>
>>>> Dstillery
>>>>
>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>
>>>

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch

Sounds like you have done your homework to properly compare .   I'm
guessing the answer to the following is yes .. but in any case:  are they
both running against the same spark cluster with the same configuration
parameters especially executor memory and number of workers?

Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
dhruba.w...@gmail.com>:

> No, i checked for that, hence written "brand new" jupyter notebook. Also
> the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
> compressed base64 encoded text data from a hive table and decompressing and
> decoding in one of the udfs. Also the time compared is from Spark UI not
> how long the job actually takes after submission. Its just the running time
> i am comparing/mentioning.
>
> As mentioned earlier, all the spark conf params even match in two scripts
> and that's why i am puzzled what going on.
>
> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, 
> wrote:
>
>> It's not obvious from what you pasted, but perhaps the juypter notebook
>> already is connected to a running spark context, while spark-submit needs
>> to get a new spot in the (YARN?) queue.
>>
>> I would check the cluster job IDs for both to ensure you're getting new
>> cluster tasks for each.
>>
>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati 
>> wrote:
>>
>>> Hi,
>>>
>>> I am facing a weird behaviour while running a python script. Here is
>>> what the code looks like mostly:
>>>
>>> def fn1(ip):
>>>some code...
>>> ...
>>>
>>> def fn2(row):
>>> ...
>>> some operations
>>> ...
>>> return row1
>>>
>>>
>>> udf_fn1 = udf(fn1)
>>> cdf = spark.read.table("") //hive table is of size > 500 Gigs with
>>> ~4500 partitions
>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>>> .drop("colz") \
>>> .withColumnRenamed("colz", "coly")
>>>
>>> edf = ddf \
>>> .filter(ddf.colp == 'some_value') \
>>> .rdd.map(lambda row: fn2(row)) \
>>> .toDF()
>>>
>>> print edf.count() // simple way for the performance test in both
>>> platforms
>>>
>>> Now when I run the same code in a brand new jupyter notebook it runs 6x
>>> faster than when I run this python script using spark-submit. The
>>> configurations are printed and  compared from both the platforms and they
>>> are exact same. I even tried to run this script in a single cell of jupyter
>>> notebook and still have the same performance. I need to understand if I am
>>> missing something in the spark-submit which is causing the issue.  I tried
>>> to minimise the script to reproduce the same error without much code.
>>>
>>> Both are run in client mode on a yarn based spark cluster. The machines
>>> from which both are executed are also the same and from same user.
>>>
>>> What i found is the  the quantile values for median for one ran with
>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am not
>>> able to figure out why this is happening.
>>>
>>> Any one faced this kind of issue before or know how to resolve this?
>>>
>>> *Regards,*
>>> *Dhrub*
>>>
>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>

Re: [MLlib] PCA Aggregator

2018-10-19 Thread Stephen Boesch

Erik - is there a current locale for approved/recommended third party
additions?  The spark-packages has been stale for years it seems.

Am Fr., 19. Okt. 2018 um 07:06 Uhr schrieb Erik Erlandson <
eerla...@redhat.com>:

> Hi Matt!
>
> There are a couple ways to do this. If you want to submit it for inclusion
> in Spark, you should start by filing a JIRA for it, and then a pull
> request.   Another possibility is to publish it as your own 3rd party
> library, which I have done for aggregators before.
>
>
> On Wed, Oct 17, 2018 at 4:54 PM Matt Saunders  wrote:
>
>> I built an Aggregator that computes PCA on grouped datasets. I wanted to
>> use the PCA functions provided by MLlib, but they only work on a full
>> dataset, and I needed to do it on a grouped dataset (like a
>> RelationalGroupedDataset).
>>
>> So I built a little Aggregator that can do that, here’s an example of how
>> it’s called:
>>
>> val pcaAggregation = new PCAAggregator(vectorColumnName).toColumn
>>
>> // For each grouping, compute a PCA matrix/vector
>> val pcaModels = inputData
>>   .groupBy(keys:_*)
>>   .agg(pcaAggregation.as(pcaOutput))
>>
>> I used the same algorithms under the hood as
>> RowMatrix.computePrincipalComponentsAndExplainedVariance, though this works
>> directly on Datasets without converting to RDD first.
>>
>> I’ve seen others who wanted this ability (for example on Stack Overflow)
>> so I’d like to contribute it if it would be a benefit to the larger
>> community.
>>
>> So.. is this something worth contributing to MLlib? And if so, what are
>> the next steps to start the process?
>>
>> thanks!
>>
>

Re: Spark.ml roadmap 2.3.0 and beyond

2018-03-20 Thread Stephen Boesch

awesome thanks Joseph

2018-03-20 14:51 GMT-07:00 Joseph Bradley <jos...@databricks.com>:

> The promised roadmap JIRA: https://issues.apache.
> org/jira/browse/SPARK-23758
>
> Note it doesn't have much explicitly listed yet, but committers can add
> items as they agree to shepherd them.  (Committers, make sure to check what
> you're currently listed as shepherding!)  The links for searching can be
> useful too.
>
> On Thu, Dec 7, 2017 at 3:55 PM, Stephen Boesch <java...@gmail.com> wrote:
>
>> Thanks Joseph.  We can wait for post 2.3.0.
>>
>> 2017-12-07 15:36 GMT-08:00 Joseph Bradley <jos...@databricks.com>:
>>
>>> Hi Stephen,
>>>
>>> I used to post those roadmap JIRAs to share instructions for
>>> contributing to MLlib and to try to coordinate amongst committers.  My
>>> feeling was that the coordination aspect was of mixed success, so I did not
>>> post one for 2.3.  I'm glad you pinged about this; if those were useful,
>>> then I can plan on posting one for the release after 2.3.  As far as
>>> identifying committers' plans, the best option right now is to look for
>>> Shepherds in JIRA as well as the few mailing list threads about directions.
>>>
>>> For myself, I'm mainly focusing on fixing some issues with persistence
>>> for custom algorithms in PySpark (done), adding the image schema (done),
>>> and using ML Pipelines in Structured Streaming (WIP).
>>>
>>> Joseph
>>>
>>> On Wed, Nov 29, 2017 at 6:52 AM, Stephen Boesch <java...@gmail.com>
>>> wrote:
>>>
>>>> There are several  JIRA's and/or PR's that contain logic the Data
>>>> Science teams that I work with use in their local models. We are trying to
>>>> determine if/when these features may gain traction again.  In at least one
>>>> case all of the work were done but the shepherd said that getting it
>>>> committed were of lower priority than other tasks - one specifically
>>>> mentioned was the mllib/ml parity that has been ongoing for nearly three
>>>> years.
>>>>
>>>> In order to prioritize work that the ML platform would do it would be
>>>> helpful to know at least which if any of those tasks were going to be moved
>>>> ahead by the community: since we could then focus on other ones instead of
>>>> duplicating the effort.
>>>>
>>>> In addition there are some engineering code jam sessions that happen
>>>> periodically: knowing which features are actively on the roadmap would 
>>>> *certainly
>>>> *influence our selection of work.  The roadmaps from 2.2.0 and earlier
>>>> were a very good starting point to understand not just the specific work in
>>>> progress - but also the current mindset/thinking of the committers in terms
>>>> of general priorities.
>>>>
>>>> So if the same format of document were not available - then what
>>>> content *is *that gives a picture of where spark.ml were headed?
>>>>
>>>> 2017-11-29 6:39 GMT-08:00 Stephen Boesch <java...@gmail.com>:
>>>>
>>>>> Any further information/ thoughts?
>>>>>
>>>>>
>>>>>
>>>>> 2017-11-22 15:07 GMT-08:00 Stephen Boesch <java...@gmail.com>:
>>>>>
>>>>>> The roadmaps for prior releases e.g. 1.6 2.0 2.1 2.2 were available:
>>>>>>
>>>>>> 2.2.0 https://issues.apache.org/jira/browse/SPARK-18813
>>>>>>
>>>>>> 2.1.0 https://issues.apache.org/jira/browse/SPARK-15581
>>>>>> ..
>>>>>>
>>>>>> It seems those roadmaps were not available per se' for 2.3.0 and
>>>>>> later? Is there a different mechanism for that info?
>>>>>>
>>>>>> stephenb
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Joseph Bradley
>>>
>>> Software Engineer - Machine Learning
>>>
>>> Databricks, Inc.
>>>
>>> [image: http://databricks.com] <http://databricks.com/>
>>>
>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>

Re: GenerateExec, CodegenSupport and supportCodegen flag off?!

2017-12-10 Thread Stephen Boesch

A relevant observation:  there was a closed/executed jira last year to
remove the option to disable the codegen flag (and unsafe flag as well):
https://issues.apache.org/jira/browse/SPARK-11644

2017-12-10 13:16 GMT-08:00 Jacek Laskowski :

> Hi,
>
> I'm wondering why a physical operator like GenerateExec would
> extend CodegenSupport [1], but had the supportCodegen flag turned off?
>
> What's the meaning of such a combination -- be a CodegenSupport with
> supportCodegen off?
>
> [1] https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala#L58-L64
>
> [2] https://github.com/apache/spark/blob/master/sql/core/
> src/main/scala/org/apache/spark/sql/execution/GenerateExec.scala#L125
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>

Re: Spark.ml roadmap 2.3.0 and beyond

2017-12-07 Thread Stephen Boesch

Thanks Joseph.  We can wait for post 2.3.0.

2017-12-07 15:36 GMT-08:00 Joseph Bradley <jos...@databricks.com>:

> Hi Stephen,
>
> I used to post those roadmap JIRAs to share instructions for contributing
> to MLlib and to try to coordinate amongst committers.  My feeling was that
> the coordination aspect was of mixed success, so I did not post one for
> 2.3.  I'm glad you pinged about this; if those were useful, then I can plan
> on posting one for the release after 2.3.  As far as identifying
> committers' plans, the best option right now is to look for Shepherds in
> JIRA as well as the few mailing list threads about directions.
>
> For myself, I'm mainly focusing on fixing some issues with persistence for
> custom algorithms in PySpark (done), adding the image schema (done), and
> using ML Pipelines in Structured Streaming (WIP).
>
> Joseph
>
> On Wed, Nov 29, 2017 at 6:52 AM, Stephen Boesch <java...@gmail.com> wrote:
>
>> There are several  JIRA's and/or PR's that contain logic the Data Science
>> teams that I work with use in their local models. We are trying to
>> determine if/when these features may gain traction again.  In at least one
>> case all of the work were done but the shepherd said that getting it
>> committed were of lower priority than other tasks - one specifically
>> mentioned was the mllib/ml parity that has been ongoing for nearly three
>> years.
>>
>> In order to prioritize work that the ML platform would do it would be
>> helpful to know at least which if any of those tasks were going to be moved
>> ahead by the community: since we could then focus on other ones instead of
>> duplicating the effort.
>>
>> In addition there are some engineering code jam sessions that happen
>> periodically: knowing which features are actively on the roadmap would 
>> *certainly
>> *influence our selection of work.  The roadmaps from 2.2.0 and earlier
>> were a very good starting point to understand not just the specific work in
>> progress - but also the current mindset/thinking of the committers in terms
>> of general priorities.
>>
>> So if the same format of document were not available - then what content *is
>> *that gives a picture of where spark.ml were headed?
>>
>> 2017-11-29 6:39 GMT-08:00 Stephen Boesch <java...@gmail.com>:
>>
>>> Any further information/ thoughts?
>>>
>>>
>>>
>>> 2017-11-22 15:07 GMT-08:00 Stephen Boesch <java...@gmail.com>:
>>>
>>>> The roadmaps for prior releases e.g. 1.6 2.0 2.1 2.2 were available:
>>>>
>>>> 2.2.0 https://issues.apache.org/jira/browse/SPARK-18813
>>>>
>>>> 2.1.0 https://issues.apache.org/jira/browse/SPARK-15581
>>>> ..
>>>>
>>>> It seems those roadmaps were not available per se' for 2.3.0 and later?
>>>> Is there a different mechanism for that info?
>>>>
>>>> stephenb
>>>>
>>>
>>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>

Re: Spark.ml roadmap 2.3.0 and beyond

2017-11-29 Thread Stephen Boesch

There are several  JIRA's and/or PR's that contain logic the Data Science
teams that I work with use in their local models. We are trying to
determine if/when these features may gain traction again.  In at least one
case all of the work were done but the shepherd said that getting it
committed were of lower priority than other tasks - one specifically
mentioned was the mllib/ml parity that has been ongoing for nearly three
years.

In order to prioritize work that the ML platform would do it would be
helpful to know at least which if any of those tasks were going to be moved
ahead by the community: since we could then focus on other ones instead of
duplicating the effort.

In addition there are some engineering code jam sessions that happen
periodically: knowing which features are actively on the roadmap would
*certainly
*influence our selection of work.  The roadmaps from 2.2.0 and earlier were
a very good starting point to understand not just the specific work in
progress - but also the current mindset/thinking of the committers in terms
of general priorities.

So if the same format of document were not available - then what content *is
*that gives a picture of where spark.ml were headed?

2017-11-29 6:39 GMT-08:00 Stephen Boesch <java...@gmail.com>:

> Any further information/ thoughts?
>
>
>
> 2017-11-22 15:07 GMT-08:00 Stephen Boesch <java...@gmail.com>:
>
>> The roadmaps for prior releases e.g. 1.6 2.0 2.1 2.2 were available:
>>
>> 2.2.0 https://issues.apache.org/jira/browse/SPARK-18813
>>
>> 2.1.0 https://issues.apache.org/jira/browse/SPARK-15581
>> ..
>>
>> It seems those roadmaps were not available per se' for 2.3.0 and later?
>> Is there a different mechanism for that info?
>>
>> stephenb
>>
>
>

Spark.ml roadmap 2.3.0 and beyond

2017-11-22 Thread Stephen Boesch

The roadmaps for prior releases e.g. 1.6 2.0 2.1 2.2 were available:

2.2.0 https://issues.apache.org/jira/browse/SPARK-18813

2.1.0 https://issues.apache.org/jira/browse/SPARK-15581
..

It seems those roadmaps were not available per se' for 2.3.0 and later? Is
there a different mechanism for that info?

stephenb

Re: Add a machine learning algorithm to sparkml

2017-10-20 Thread Stephen Boesch

A couple of less obvious facets of getting over the (significant!) hurdle
to have an algorithm accepted into mllib (/spark.ml):


   - the review time can be *very *long -  a few to many months is a
   typical case even for relatively fast tracked algorithms
   - you will likely be asked to provide evidence of a strong perceived
   need within the community/industry for the algorithm

These considerations may make it challenging for you to find a
yet-unimplemented algorithm that can be completed within a constrained
timeframe.



2017-10-20 19:43 GMT-07:00 Manolis Gemeliaris :

> Hello everyone,
>
> I am an undergraduate student and now looking to do my final year project. 
> Professor
> Minos Garofalakis    suggested to me
> that as a  project , I could find a machine learning  algorithm not
> implemented by anyone ,in Spark.ml and implement it.
> As the topic is related to contributing code (an algorithm implementation)
> to Spark, I address to you also.
> My question to  you is , are there any suggestions about what algorithm is
> missing from spark.ml currently that would be a good option to implement?
> (e.g. k-means and lda are already there and so is lsvm)
>
> Thanks in advance.
>

Re: MLlib mission and goals

2017-01-24 Thread Stephen Boesch

re: spark-packages.org  and "Would these really be better in the core
project?"   That was not at all the intent of my input: instead to ask "how
and where to structure/place deployment quality code that yet were *not*
part of the distribution?"   The spark packages has no curation whatsoever
: no minimum standards of code quality and deployment structures, let alone
qualitative measures of usefulness.

While spark packages would never rival CRAN and friends there is not even
any mechanism in place to get started.  From the CRAN site:

   Even at the current growth rate of several packages a day, all
submissions are still rigorously quality-controlled using strong testing
features available in the R system .

Maybe give something that has a subset of these processes a try ?
Different folks than are already over-subscribed in MLlib ?

2017-01-24 2:37 GMT-08:00 Sean Owen :

> My $0.02, which shouldn't be weighted too much.
>
> I believe the mission as of Spark ML has been to provide the framework,
> and then implementation of 'the basics' only. It should have the tools that
> cover ~80% of use cases, out of the box, in a pretty well-supported and
> tested way.
>
> It's not a goal to support an arbitrarily large collection of algorithms
> because each one adds marginally less value, and IMHO, is proportionally
> bigger baggage, because the contributors tend to skew academic, produce
> worse code, and don't stick around to maintain it.
>
> The project is already generally quite overloaded; I don't know if there's
> bandwidth to even cover the current scope. While 'the basics' is a
> subjective label, de facto, I think we'd have to define it as essentially
> "what we already have in place" for the foreseeable future.
>
> That the bits on spark-packages.org aren't so hot is not a problem but a
> symptom. Would these really be better in the core project?
>
> And, or: I entirely agree with Joseph's take.
>
>
> On Tue, Jan 24, 2017 at 1:03 AM Joseph Bradley 
> wrote:
>
>> This thread is split off from the "Feedback on MLlib roadmap process
>> proposal" thread for discussing the high-level mission and goals for
>> MLlib.  I hope this thread will collect feedback and ideas, not necessarily
>> lead to huge decisions.
>>
>> Copying from the previous thread:
>>
>> *Seth:*
>> """
>> I would love to hear some discussion on the higher level goal of Spark
>> MLlib (if this derails the original discussion, please let me know and we
>> can discuss in another thread). The roadmap does contain specific items
>> that help to convey some of this (ML parity with MLlib, model persistence,
>> etc...), but I'm interested in what the "mission" of Spark MLlib is. We
>> often see PRs for brand new algorithms which are sometimes rejected and
>> sometimes not. Do we aim to keep implementing more and more algorithms? Or
>> is our focus really, now that we have a reasonable library of algorithms,
>> to simply make the existing ones faster/better/more robust? Should we aim
>> to make interfaces that are easily extended for developers to easily
>> implement their own custom code (e.g. custom optimization libraries), or do
>> we want to restrict things to out-of-the box algorithms? Should we focus on
>> more flexible, general abstractions like distributed linear algebra?
>>
>> I was not involved in the project in the early days of MLlib when this
>> discussion may have happened, but I think it would be useful to either
>> revisit it or restate it here for some of the newer developers.
>> """
>>
>> *Mingjie:*
>> """
>> +1 general abstractions like distributed linear algebra.
>> """
>>
>>
>> I'll add my thoughts, starting with our past *t**rajectory*:
>> * Initially, MLlib was mainly trying to build a set of core algorithms.
>> * Two years ago, the big effort was adding Pipelines.
>> * In the last year, big efforts have been around completing Pipelines and
>> making the library more robust.
>>
>> I agree with Seth that a few *immediate goals* are very clear:
>> * feature parity for DataFrame-based API
>> * completing and improving testing for model persistence
>> * Python, R parity
>>
>> *In the future*, it's harder to say, but if I had to pick my top 2
>> items, I'd list:
>>
>> *(1) Making MLlib more extensible*
>> It will not be feasible to support a huge number of algorithms, so
>> allowing users to customize their ML on Spark workflows will be critical.
>> This is IMO the most important thing we could do for MLlib.
>> Part of this could be building a healthy community of Spark Packages, and
>> we will need to make it easier for users to write their own algorithms and
>> packages to facilitate this.  Part of this could be allowing users to
>> customize existing algorithms with custom loss functions, etc.
>>
>> *(2) Consistent improvements to core algorithms*
>> A less exciting but still very important item will be constantly
>> improving the core set of algorithms in MLlib. This could mean speed,
>>

Re: MLlib mission and goals

2017-01-23 Thread Stephen Boesch

Along the lines of #1:  the spark packages seemed to have had a good start
about two years ago: but now there are not more than a handful in general
use - e.g. databricks CSV.
When the available packages are browsed the majority are incomplete, empty,
unmaintained, or unclear.

Any ideas on how to resurrect spark packages in a way that there will be
sufficient adoption for it to be meaningful?

2017-01-23 17:03 GMT-08:00 Joseph Bradley :

> This thread is split off from the "Feedback on MLlib roadmap process
> proposal" thread for discussing the high-level mission and goals for
> MLlib.  I hope this thread will collect feedback and ideas, not necessarily
> lead to huge decisions.
>
> Copying from the previous thread:
>
> *Seth:*
> """
> I would love to hear some discussion on the higher level goal of Spark
> MLlib (if this derails the original discussion, please let me know and we
> can discuss in another thread). The roadmap does contain specific items
> that help to convey some of this (ML parity with MLlib, model persistence,
> etc...), but I'm interested in what the "mission" of Spark MLlib is. We
> often see PRs for brand new algorithms which are sometimes rejected and
> sometimes not. Do we aim to keep implementing more and more algorithms? Or
> is our focus really, now that we have a reasonable library of algorithms,
> to simply make the existing ones faster/better/more robust? Should we aim
> to make interfaces that are easily extended for developers to easily
> implement their own custom code (e.g. custom optimization libraries), or do
> we want to restrict things to out-of-the box algorithms? Should we focus on
> more flexible, general abstractions like distributed linear algebra?
>
> I was not involved in the project in the early days of MLlib when this
> discussion may have happened, but I think it would be useful to either
> revisit it or restate it here for some of the newer developers.
> """
>
> *Mingjie:*
> """
> +1 general abstractions like distributed linear algebra.
> """
>
>
> I'll add my thoughts, starting with our past *t**rajectory*:
> * Initially, MLlib was mainly trying to build a set of core algorithms.
> * Two years ago, the big effort was adding Pipelines.
> * In the last year, big efforts have been around completing Pipelines and
> making the library more robust.
>
> I agree with Seth that a few *immediate goals* are very clear:
> * feature parity for DataFrame-based API
> * completing and improving testing for model persistence
> * Python, R parity
>
> *In the future*, it's harder to say, but if I had to pick my top 2 items,
> I'd list:
>
> *(1) Making MLlib more extensible*
> It will not be feasible to support a huge number of algorithms, so
> allowing users to customize their ML on Spark workflows will be critical.
> This is IMO the most important thing we could do for MLlib.
> Part of this could be building a healthy community of Spark Packages, and
> we will need to make it easier for users to write their own algorithms and
> packages to facilitate this.  Part of this could be allowing users to
> customize existing algorithms with custom loss functions, etc.
>
> *(2) Consistent improvements to core algorithms*
> A less exciting but still very important item will be constantly improving
> the core set of algorithms in MLlib. This could mean speed, scaling,
> robustness, and usability for the few algorithms which cover 90% of use
> cases.
>
> There are plenty of other possibilities, and it will be great to hear the
> community's thoughts!
>
> Thanks,
> Joseph
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>

Re: Organizing Spark ML example packages

2016-09-12 Thread Stephen Boesch

Yes: will you have cycles to do it?

2016-09-12 9:09 GMT-07:00 Nick Pentreath :

> Never actually got around to doing this - do folks still think it
> worthwhile?
>
> On Thu, 21 Apr 2016 at 00:10 Joseph Bradley  wrote:
>
>> Sounds good to me.  I'd request we be strict during this process about
>> requiring *no* changes to the example itself, which will make review easier.
>>
>> On Tue, Apr 19, 2016 at 11:12 AM, Bryan Cutler  wrote:
>>
>>> +1, adding some organization would make it easier for people to find a
>>> specific example
>>>
>>> On Mon, Apr 18, 2016 at 11:52 PM, Yanbo Liang 
>>> wrote:
>>>
 This sounds good to me, and it will make ML examples more neatly.

 2016-04-14 5:28 GMT-07:00 Nick Pentreath :

> Hey Spark devs
>
> I noticed that we now have a large number of examples for ML & MLlib
> in the examples project - 57 for ML and 67 for MLLIB to be precise. This 
> is
> bound to get larger as we add features (though I know there are some PRs 
> to
> clean up duplicated examples).
>
> What do you think about organizing them into packages to match the use
> case and the structure of the code base? e.g.
>
> org.apache.spark.examples.ml.recommendation
>
> org.apache.spark.examples.ml.feature
>
> and so on...
>
> Is it worth doing? The doc pages with include_example would need
> updating, and the run_example script input would just need to change the
> package slightly. Did I miss any potential issue?
>
> N
>


>>>
>>

Re: [discuss] ending support for Java 7 in Spark 2.0

2016-03-24 Thread Stephen Boesch

+1 for java8 only   +1 for 2.11+ only .At this point scala libraries
supporting only 2.10 are typically less active and/or poorly maintained.
That trend will only continue when considering the lifespan of spark 2.X.

2016-03-24 11:32 GMT-07:00 Steve Loughran :

>
> On 24 Mar 2016, at 15:27, Koert Kuipers  wrote:
>
> i think the arguments are convincing, but it also makes me wonder if i
> live in some kind of alternate universe... we deploy on customers clusters,
> where the OS, python version, java version and hadoop distro are not chosen
> by us. so think centos 6, cdh5 or hdp 2.3, java 7 and python 2.6. we simply
> have access to a single proxy machine and launch through yarn. asking them
> to upgrade java is pretty much out of the question or a 6+ month ordeal. of
> the 10 client clusters i can think of on the top of my head all of them are
> on java 7, none are on java 8. so by doing this you would make spark 2
> basically unusable for us (unless most of them have plans of upgrading in
> near term to java 8, i will ask around and report back...).
>
>
>
> It's not actually mandatory for the process executing in the Yarn cluster
> to run with the same JVM as the rest of the Hadoop stack; all that is
> needed is for the environment variables to set up the JAVA_HOME and PATH.
> Switching JVMs not something which YARN makes it easy to do, but it may be
> possible, especially if Spark itself provides some hooks, so you don't have
> to manually lay with setting things up. That may be something which could
> significantly ease adoption of Spark 2 in YARN clusters. Same for Python.
>
> This is something I could probably help others to address
>
>

Re: spark task scheduling delay

2016-01-20 Thread Stephen Boesch

Which Resource Manager  are you using?

2016-01-20 21:38 GMT-08:00 Renu Yadav :

> Any suggestions?
>
> On Wed, Jan 20, 2016 at 6:50 PM, Renu Yadav  wrote:
>
>> Hi ,
>>
>> I am facing spark   task scheduling delay issue in spark 1.4.
>>
>> suppose I have 1600 tasks running then 1550 tasks runs fine but for the
>> remaining 50 i am facing task delay even if the input size of these task is
>> same as the above 1550 tasks
>>
>> Please suggest some solution.
>>
>> Thanks & Regards
>> Renu Yadav
>>
>
>

Re: what is the best way to debug spark / mllib?

2015-12-27 Thread Stephen Boesch

1) you should run zinc incremental compiler
2) if you want breakpoints that should likely be done in local mode
3) adjust the log4j.properties settings and you can start to see the logInfo

2015-12-27 0:20 GMT-08:00 salexln :

> Hi guys,
>
> I'm debugging my code in mllib/clustering but i'm not sure i'm doing it the
> best way:
> I build my changes in mllib using "build/mvn -DskipTests package" and then
> running invoking my code using
> "./bin/spark-shell"
>
> My two main issues:
> 1) After each change the build (build/mvn -DskipTests package) takes ~15
> mins
> 2) I cannot put breakpoints
> 3) If I add println of logInfo, I do not see it in the console.
>
> What us the best way to debug it?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/what-is-the-best-way-to-debug-spark-mllib-tp15809.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: SQL language vs DataFrame API

2015-12-09 Thread Stephen Boesch

Is this a candidate for the version 1.X/2.0 split?

2015-12-09 16:29 GMT-08:00 Michael Armbrust :

> Yeah, I would like to address any actual gaps in functionality that are
> present.
>
> On Wed, Dec 9, 2015 at 4:24 PM, Cristian Opris  > wrote:
>
>> The reason I'm asking is because it's important in larger projects to be
>> able to stick to a particular programming style. Some people are more
>> comfortable with SQL, others might find the DF api more suitable, but it's
>> important to have full expressivity in both to make it easier to adopt one
>> approach rather than have to mix and match to achieve full functionality.
>>
>> On 9 December 2015 at 19:41, Xiao Li  wrote:
>>
>>> That sounds great! When it is decided, please let us know and we can add
>>> more features and make it ANSI SQL compliant.
>>>
>>> Thank you!
>>>
>>> Xiao Li
>>>
>>>
>>> 2015-12-09 11:31 GMT-08:00 Michael Armbrust :
>>>
 I don't plan to abandon HiveQL compatibility, but I'd like to see us
 move towards something with more SQL compliance (perhaps just newer
 versions of the HiveQL parser).  Exactly which parser will do that for us
 is under investigation.

 On Wed, Dec 9, 2015 at 11:02 AM, Xiao Li  wrote:

> Hi, Michael,
>
> Does that mean SqlContext will be built on HiveQL in the near future?
>
> Thanks,
>
> Xiao Li
>
>
> 2015-12-09 10:36 GMT-08:00 Michael Armbrust :
>
>> I think that it is generally good to have parity when the
>> functionality is useful.  However, in some cases various features are 
>> there
>> just to maintain compatibility with other system.  For example CACHE 
>> TABLE
>> is eager because Shark's cache table was.  df.cache() is lazy because
>> Spark's cache is.  Does that mean that we need to add some eager caching
>> mechanism to dataframes to have parity?  Probably not, users can just 
>> call
>> .count() if they want to force materialization.
>>
>> Regarding the differences between HiveQL and the SQLParser, I think
>> we should get rid of the SQL parser.  Its kind of a hack that I built 
>> just
>> so that there was some SQL story for people who didn't compile with Hive.
>> Moving forward, I'd like to see the distinction between the HiveContext 
>> and
>> SQLContext removed and we can standardize on a single parser.  For this
>> reason I'd be opposed to spending a lot of dev/reviewer time on adding
>> features there.
>>
>> On Wed, Dec 9, 2015 at 8:34 AM, Cristian O <
>> cristian.b.op...@googlemail.com> wrote:
>>
>>> Hi,
>>>
>>> I was wondering what the "official" view is on feature parity
>>> between SQL and DF apis. Docs are pretty sparse on the SQL front, and it
>>> seems that some features are only supported at various times in only 
>>> one of
>>> Spark SQL dialect, HiveQL dialect and DF API. DF.cube(), DISTRIBUTE BY,
>>> CACHE LAZY are some examples
>>>
>>> Is there an explicit goal of having consistent support for all
>>> features in both DF and SQL ?
>>>
>>> Thanks,
>>> Cristian
>>>
>>
>>
>

>>>
>>
>

Re: Fastest way to build Spark from scratch

2015-12-08 Thread Stephen Boesch

I will echo Steve L's comment about having zinc running (with --nailed).
That provides at least a 2X speedup - sometimes without it spark simply
does not build for me.

2015-12-08 9:33 GMT-08:00 Josh Rosen :

> @Nick, on a fresh EC2 instance a significant chunk of the initial build
> time might be due to artifact resolution + downloading. Putting
> pre-populated Ivy and Maven caches onto your EC2 machine could shave a
> decent chunk of time off that first build.
>
> On Tue, Dec 8, 2015 at 9:16 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Thanks for the tips, Jakob and Steve.
>>
>> It looks like my original approach is the best for me since I'm
>> installing Spark on newly launched EC2 instances and can't take advantage
>> of incremental compilation.
>>
>> Nick
>>
>> On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran 
>> wrote:
>>
>>> On 7 Dec 2015, at 19:07, Jakob Odersky  wrote:
>>>
>>> make-distribution and the second code snippet both create a distribution
>>> from a clean state. They therefore require that every source file be
>>> compiled and that takes time (you can maybe tweak some settings or use a
>>> newer compiler to gain some speed).
>>>
>>> I'm inferring from your question that for your use-case deployment speed
>>> is a critical issue, furthermore you'd like to build Spark for lots of
>>> (every?) commit in a systematic way. In that case I would suggest you try
>>> using the second code snippet without the `clean` task and only resort to
>>> it if the build fails.
>>>
>>> On my local machine, an assembly without a clean drops from 6 minutes to
>>> 2.
>>>
>>> regards,
>>> --Jakob
>>>
>>>
>>> 1. you can use zinc -where possible- to speed up scala compilations
>>> 2. you might also consider setting up a local jenkins VM, hooked to
>>> whatever git repo & branch you are working off, and have it do the builds
>>> and tests for you. Not so great for interactive dev,
>>>
>>> finally, on the mac, the "say" command is pretty handy at letting you
>>> know when some work in a terminal is ready, so you can do the
>>> first-thing-in-the morning build-of-the-SNAPSHOTS
>>>
>>> mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo
>>>
>>> After that you can work on the modules you care about (via the -pl)
>>> option). That doesn't work if you are running on an EC2 instance though
>>>
>>>
>>>
>>>
>>> On 23 November 2015 at 20:18, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 Say I want to build a complete Spark distribution against Hadoop 2.6+
 as fast as possible from scratch.

 This is what I’m doing at the moment:

 ./make-distribution.sh -T 1C -Phadoop-2.6

 -T 1C instructs Maven to spin up 1 thread per available core. This
 takes around 20 minutes on an m3.large instance.

 I see that spark-ec2, on the other hand, builds Spark as follows
 
 when you deploy Spark at a specific git commit:

 sbt/sbt clean assembly
 sbt/sbt publish-local

 This seems slower than using make-distribution.sh, actually.

 Is there a faster way to do this?

 Nick
 

>>>
>>>
>>>
>

Re: A proposal for Spark 2.0

2015-11-12 Thread Stephen Boesch

My understanding is that  the RDD's presently have more support for
complete control of partitioning which is a key consideration at scale.
While partitioning control is still piecemeal in  DF/DS  it would seem
premature to make RDD's a second-tier approach to spark dev.

An example is the use of groupBy when we know that the source relation
(/RDD) is already partitioned on the grouping expressions.  AFAIK the spark
sql still does not allow that knowledge to be applied to the optimizer - so
a full shuffle will be performed. However in the native RDD we can use
preservesPartitioning=true.

2015-11-12 17:42 GMT-08:00 Mark Hamstra :

> The place of the RDD API in 2.0 is also something I've been wondering
> about.  I think it may be going too far to deprecate it, but changing
> emphasis is something that we might consider.  The RDD API came well before
> DataFrames and DataSets, so programming guides, introductory how-to
> articles and the like have, to this point, also tended to emphasize RDDs --
> or at least to deal with them early.  What I'm thinking is that with 2.0
> maybe we should overhaul all the documentation to de-emphasize and
> reposition RDDs.  In this scheme, DataFrames and DataSets would be
> introduced and fully addressed before RDDs.  They would be presented as the
> normal/default/standard way to do things in Spark.  RDDs, in contrast,
> would be presented later as a kind of lower-level, closer-to-the-metal API
> that can be used in atypical, more specialized contexts where DataFrames or
> DataSets don't fully fit.
>
> On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao  wrote:
>
>> I am not sure what the best practice for this specific problem, but it’s
>> really worth to think about it in 2.0, as it is a painful issue for lots of
>> users.
>>
>>
>>
>> By the way, is it also an opportunity to deprecate the RDD API (or
>> internal API only?)? As lots of its functionality overlapping with
>> DataFrame or DataSet.
>>
>>
>>
>> Hao
>>
>>
>>
>> *From:* Kostas Sakellis [mailto:kos...@cloudera.com]
>> *Sent:* Friday, November 13, 2015 5:27 AM
>> *To:* Nicholas Chammas
>> *Cc:* Ulanov, Alexander; Nan Zhu; wi...@qq.com; dev@spark.apache.org;
>> Reynold Xin
>>
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> I know we want to keep breaking changes to a minimum but I'm hoping that
>> with Spark 2.0 we can also look at better classpath isolation with user
>> programs. I propose we build on spark.{driver|executor}.userClassPathFirst,
>> setting it true by default, and not allow any spark transitive dependencies
>> to leak into user code. For backwards compatibility we can have a whitelist
>> if we want but I'd be good if we start requiring user apps to explicitly
>> pull in all their dependencies. From what I can tell, Hadoop 3 is also
>> moving in this direction.
>>
>>
>>
>> Kostas
>>
>>
>>
>> On Thu, Nov 12, 2015 at 9:56 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>> On that note of deprecating stuff, it might be good to deprecate some
>> things in 2.0 without removing or replacing them immediately. That way 2.0
>> doesn’t have to wait for everything that we want to deprecate to be
>> replaced all at once.
>>
>> Nick
>>
>> 
>>
>>
>>
>> On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander <
>> alexander.ula...@hpe.com> wrote:
>>
>> Parameter Server is a new feature and thus does not match the goal of 2.0
>> is “to fix things that are broken in the current API and remove certain
>> deprecated APIs”. At the same time I would be happy to have that feature.
>>
>>
>>
>> With regards to Machine learning, it would be great to move useful
>> features from MLlib to ML and deprecate the former. Current structure of
>> two separate machine learning packages seems to be somewhat confusing.
>>
>> With regards to GraphX, it would be great to deprecate the use of RDD in
>> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>>
>>
>>
>> Best regards, Alexander
>>
>>
>>
>> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com]
>> *Sent:* Thursday, November 12, 2015 7:28 AM
>> *To:* wi...@qq.com
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: A proposal for Spark 2.0
>>
>>
>>
>> Being specific to Parameter Server, I think the current agreement is that
>> PS shall exist as a third-party library instead of a component of the core
>> code base, isn’t?
>>
>>
>>
>> Best,
>>
>>
>>
>> --
>>
>> Nan Zhu
>>
>> http://codingcat.me
>>
>>
>>
>> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
>>
>> Who has the idea of machine learning? Spark missing some features

Re: State of the Build

2015-11-05 Thread Stephen Boesch

Yes. The current dev/change-scala-version.sh mutates (/pollutes) the build
environment by updating the pom.xml in each of the subprojects. If you were
able to come up with a structure that avoids that approach it would be an
improvement.

2015-11-05 15:38 GMT-08:00 Jakob Odersky :

> Hi everyone,
> in the process of learning Spark, I wanted to get an overview of the
> interaction between all of its sub-projects. I therefore decided to have a
> look at the build setup and its dependency management.
> Since I am alot more comfortable using sbt than maven, I decided to try to
> port the maven configuration to sbt (with the help of automated tools).
> This led me to a couple of observations and questions on the build system
> design:
>
> First, currently, there are two build systems, maven and sbt. Is there a
> preferred tool (or future direction to one)?
>
> Second, the sbt build also uses maven "profiles" requiring the use of
> specific commandline parameters when starting sbt. Furthermore, since it
> relies on maven poms, dependencies to the scala binary version (_2.xx) are
> hardcoded and require running an external script when switching versions.
> Sbt could leverage built-in constructs to support cross-compilation and
> emulate profiles with configurations and new build targets. This would
> remove external state from the build (in that no extra steps need to be
> performed in a particular order to generate artifacts for a new
> configuration) and therefore improve stability and build reproducibility
> (maybe even build performance). I was wondering if implementing such
> functionality for the sbt build would be welcome?
>
> thanks,
> --Jakob
>

Re: Using scala-2.11 when making changes to spark source

2015-09-28 Thread Stephen Boesch

The effects of changing the pom.xml extend beyond cases in which we wish to
modify spark itself. In addition when git pull'ing from trunk we need to
either stash or roll back the changes before rebase'ing.

An effort to look into a better solution (possibly including evaluating Ted
Yu's suggested approach) might be considered?

2015-09-20 9:12 GMT-07:00 Ted Yu <yuzhih...@gmail.com>:

> Maybe the following can be used for changing Scala version:
> http://maven.apache.org/archetype/maven-archetype-plugin/
>
> I played with it a little bit but didn't get far.
>
> FYI
>
> On Sun, Sep 20, 2015 at 6:18 AM, Stephen Boesch <java...@gmail.com> wrote:
>
>>
>> The dev/change-scala-version.sh [2.11]  script modifies in-place  the
>> pom.xml files across all of the modules.  This is a git-visible change.  So
>> if we wish to make changes to spark source in our own fork's - while
>> developing with scala 2.11 - we would end up conflating those updates with
>> our own.
>>
>> A possible scenario would be to update .gitignore - by adding pom.xml.
>> However I can not get that to work: .gitignore is tricky.
>>
>> Suggestions appreciated.
>>
>
>

Using scala-2.11 when making changes to spark source

2015-09-20 Thread Stephen Boesch

The dev/change-scala-version.sh [2.11]  script modifies in-place  the
pom.xml files across all of the modules.  This is a git-visible change.  So
if we wish to make changes to spark source in our own fork's - while
developing with scala 2.11 - we would end up conflating those updates with
our own.

A possible scenario would be to update .gitignore - by adding pom.xml.
However I can not get that to work: .gitignore is tricky.

Suggestions appreciated.

Re: Enum parameter in ML

2015-09-16 Thread Stephen Boesch

There was a long thread about enum's initiated by Xiangrui several months
back in which the final consensus was to use java enum's.  Is that
discussion (/decision) applicable here?

2015-09-16 17:43 GMT-07:00 Ulanov, Alexander :

> Hi Joseph,
>
>
>
> Strings sounds reasonable. However, there is no StringParam (only
> StringArrayParam). Should I create a new param type? Also, how can the user
> get all possible values of String parameter?
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Joseph Bradley [mailto:jos...@databricks.com]
> *Sent:* Wednesday, September 16, 2015 5:35 PM
> *To:* Feynman Liang
> *Cc:* Ulanov, Alexander; dev@spark.apache.org
> *Subject:* Re: Enum parameter in ML
>
>
>
> I've tended to use Strings.  Params can be created with a validator
> (isValid) which can ensure users get an immediate error if they try to pass
> an unsupported String.  Not as nice as compile-time errors, but easier on
> the APIs.
>
>
>
> On Mon, Sep 14, 2015 at 6:07 PM, Feynman Liang 
> wrote:
>
> We usually write a Java test suite which exercises the public API (e.g.
> DCT
> 
> ).
>
>
>
> It may be possible to create a sealed trait with singleton concrete
> instances inside of a serializable companion object, the just introduce a
> Param[SealedTrait] to the model (e.g. StreamingDecay PR
> ).
> However, this would require Java users to use
> CompanionObject$.ConcreteInstanceName to access enum values which isn't the
> prettiest syntax.
>
>
>
> Another option would just be to use Strings, which although is not type
> safe does simplify implementation.
>
>
>
> On Mon, Sep 14, 2015 at 5:43 PM, Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
> Hi Feynman,
>
>
>
> Thank you for suggestion. How can I ensure that there will be no problems
> for Java users? (I only use Scala API)
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Feynman Liang [mailto:fli...@databricks.com]
> *Sent:* Monday, September 14, 2015 5:27 PM
> *To:* Ulanov, Alexander
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Enum parameter in ML
>
>
>
> Since PipelineStages are serializable, the params must also be
> serializable. We also have to keep the Java API in mind. Introducing a new
> enum Param type may work, but we will have to ensure that Java users can
> use it without dealing with ClassTags (I believe Scala will create new
> types for each possible value in the Enum) and that it can be serialized.
>
>
>
> On Mon, Sep 14, 2015 at 4:31 PM, Ulanov, Alexander <
> alexander.ula...@hpe.com> wrote:
>
> Dear Spark developers,
>
>
>
> I am currently implementing the Estimator in ML that has a parameter that
> can take several different values that are mutually exclusive. The most
> appropriate type seems to be Scala Enum (
> http://www.scala-lang.org/api/current/index.html#scala.Enumeration).
> However, the current ML API has the following parameter types:
>
> BooleanParam, DoubleArrayParam, DoubleParam, FloatParam, IntArrayParam,
> IntParam, LongParam, StringArrayParam
>
>
>
> Should I introduce a new parameter type in ML API that is based on Scala
> Enum?
>
>
>
> Best regards, Alexander
>
>
>
>
>
>
>

Re: The latest master branch didn't compile with -Phive?

2015-07-09 Thread Stephen Boesch

Please do a *clean* package and reply back if you still encounter issues.

2015-07-09 7:24 GMT-07:00 Yijie Shen henry.yijies...@gmail.com:

 Hi,

 I use the clean version just clone from the master branch, build with:

 build/mvn -Phive -Phadoop-2.4 -DskipTests package

 And BUILD FAILURE at last, due to:

 [error]  while compiling: 
 /Users/yijie/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
 [error] during phase: typer
 [error]  library version: version 2.10.4
 [error] compiler version: version 2.10.4
 ...
 [error]
 [error]   last tree to typer: Ident(Warehouse)
 [error]   symbol: none (flags: )
 [error]symbol definition: none
 [error]symbol owners:
 [error]   context owners: lazy value hiveWarehouse - class 
 HiveMetastoreCatalog - package hive
 [error]
 [error] == Enclosing template or block ==
 [error]
 [error] Template( // val local HiveMetastoreCatalog: notype in class 
 HiveMetastoreCatalog
 [error]   Catalog, Logging // parents
 [error]   ValDef(
 [error] private
 [error] _
 [error] tpt
 [error] empty
 [error]   )
 [error]   // 24 statements
 [error]   ValDef( // private[this] val client: 
 org.apache.spark.sql.hive.client.ClientInterface in class HiveMetastoreCatalog
 [error] private local paramaccessor
 [error] client
 [error] ClientInterface
 [error] empty
 …


 https://gist.github.com/yijieshen/e0925e2227a312ae4c64#file-build_failure

 Did I make a silly mistake?

 Thanks, Yijie

Re: enum-like types in Spark

2015-07-01 Thread Stephen Boesch

I am reviving an old thread here. The link for the example code for the
java enum based solution is now dead: would someone please post an updated
link showing the proper interop?

Specifically: it is my understanding that java enum's may not be created
within Scala. So is the proposed solution requiring dropping out into Java
to create the enum's?

2015-04-09 17:16 GMT-07:00 Xiangrui Meng men...@gmail.com:

Using Java enums sound good. We can list the values in the JavaDoc and
hope Scala will be able to correctly generate docs for Java enums in
the future. -Xiangrui

On Thu, Apr 9, 2015 at 10:59 AM, Imran Rashid iras...@cloudera.com
wrote:
any update here? This is relevant for a currently open PR of mine --
I've
got a bunch of new public constants defined w/ format #4, but I'd gladly
switch to java enums. (Even if we are just going to postpone this
decision,
I'm still inclined to switch to java enums ...)

just to be clear about the existing problem with enums scaladoc: right
now, the scaladoc knows about the enum class, and generates a page for
it,
but it does not display the enum constants. It is at least labeled as a
java enum, though, so a savvy user could switch to the javadocs to see
the
constants.

On Mon, Mar 23, 2015 at 4:50 PM, Imran Rashid iras...@cloudera.com
wrote:

well, perhaps I overstated things a little, I wouldn't call it the
official solution, just a recommendation in the never-ending debate
(and
the recommendation from folks with their hands on scala itself).

Even if we do get this fixed in scaladoc eventually -- as its not in the
current versions, where does that leave this proposal? personally I'd
*still* prefer java enums, even if it doesn't get into scaladoc. btw,
even
with sealed traits, the scaladoc still isn't great -- you don't see the
values from the class, you only see them listed from the companion
object.
(though, that is somewhat standard for scaladoc, so maybe I'm reaching a
little)

On Mon, Mar 23, 2015 at 4:11 PM, Patrick Wendell pwend...@gmail.com
wrote:

If the official solution from the Scala community is to use Java
enums, then it seems strange they aren't generated in scaldoc? Maybe
we can just fix that w/ Typesafe's help and then we can use them.

On Mon, Mar 23, 2015 at 1:46 PM, Sean Owen so...@cloudera.com wrote:
Yeah the fully realized #4, which gets back the ability to use it in
switch statements (? in Scala but not Java?) does end up being kind
of
huge.

I confess I'm swayed a bit back to Java enums, seeing what it
involves. The hashCode() issue can be 'solved' with the hash of the
String representation.

On Mon, Mar 23, 2015 at 8:33 PM, Imran Rashid iras...@cloudera.com
wrote:
I've just switched some of my code over to the new format, and I
just
want
to make sure everyone realizes what we are getting into. I went
from
10
lines as java enums

https://github.com/squito/spark/blob/fef66058612ebf225e58dd5f5fea6bae1afd5b31/core/src/main/java/org/apache/spark/status/api/StageStatus.java#L20

to 30 lines with the new format:

https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/status/api/v1/api.scala#L250

its not just that its verbose. each name has to be repeated 4
times,
with
potential typos in some locations that won't be caught by the
compiler.
Also, you have to manually maintain the values as you update the
set
of
enums, the compiler won't do it for you.

The only downside I've heard for java enums is enum.hashcode().
OTOH,
the
downsides for this version are: maintainability / verbosity, no
values(),
more cumbersome to use from java, no enum map / enumset.

I did put together a little util to at least get back the equivalent
of
enum.valueOf() with this format

https://github.com/squito/spark/blob/SPARK-3454_w_jersey/core/src/main/scala/org/apache/spark/util/SparkEnum.scala

I'm not trying to prevent us from moving forward on this, its fine
if
this
is still what everyone wants, but I feel pretty strongly java enums
make
more sense.

thanks,
Imran

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: How to link code pull request with JIRA ID?

2015-05-13 Thread Stephen Boesch

following up from Nicholas, it is

[SPARK-12345] Your PR description

where 12345 is the jira number.


One thing I tend to forget is when/where to include the subproject tag e.g.
 [MLLIB]


2015-05-13 11:11 GMT-07:00 Nicholas Chammas nicholas.cham...@gmail.com:

 That happens automatically when you open a PR with the JIRA key in the PR
 title.

 On Wed, May 13, 2015 at 2:10 PM Chandrashekhar Kotekar 
 shekhar.kote...@gmail.com wrote:

  Hi,
 
  I am new to open source contribution and trying to understand the process
  starting from pulling code to uploading patch.
 
  I have managed to pull code from GitHub. In JIRA I saw that each JIRA
 issue
  is connected with pull request. I would like to know how do people attach
  pull request details to JIRA issue?
 
  Thanks,
  Chandrash3khar Kotekar
  Mobile - +91 8600011455

Re: Pickling error when attempting to add a method in pyspark

2015-04-30 Thread Stephen Boesch

Bumping this.  Anyone of you having some familiarity with py4j interface in
pyspark?

thanks


2015-04-27 22:09 GMT-07:00 Stephen Boesch java...@gmail.com:


 My intention is to add pyspark support for certain mllib spark methods.  I
 have been unable to resolve pickling errors of the form

Pyspark py4j PickleException: “expected zero arguments for
 construction of ClassDict”
 http://stackoverflow.com/questions/29910708/pyspark-py4j-pickleexception-expected-zero-arguments-for-construction-of-class

 These are occurring during python to java conversion of python named
 tuples.  The details are rather hard to provide here so I have created an
 SOF question


 http://stackoverflow.com/questions/29910708/pyspark-py4j-pickleexception-expected-zero-arguments-for-construction-of-class

 In any case I have included the text here. The SOF is easier to read
 though ;)

 --

 This question is directed towards persons familiar with py4j - and can
 help to resolve a pickling error. I am trying to add a method to the
 pyspark PythonMLLibAPI that accepts an RDD of a namedtuple, does some work,
 and returns a result in the form of an RDD.

 This method is modeled after the PYthonMLLibAPI.trainALSModel() method,
 whose analogous *existing* relevant portions are:

   def trainALSModel(
 ratingsJRDD: JavaRDD[Rating],
 .. )

 The *existing* python Rating class used to model the new code is:

 class Rating(namedtuple(Rating, [user, product, rating])):
 def __reduce__(self):
 return Rating, (int(self.user), int(self.product), float(self.rating))

 Here is the attempt So here are the relevant classes:

 *New* python class pyspark.mllib.clustering.MatrixEntry:

 from collections import namedtupleclass MatrixEntry(namedtuple(MatrixEntry, 
 [x,y,weight])):
 def __reduce__(self):
 return MatrixEntry, (long(self.x), long(self.y), float(self.weight))

 *New* method *foobarRDD* In PythonMLLibAPI:

   def foobarRdd(
 data: JavaRDD[MatrixEntry]): RDD[FooBarResult] = {
 val rdd = data.rdd.map { d = FooBarResult(d.i, d.j, d.value, d.i * 100 + 
 d.j * 10 + d.value)}
 rdd
   }

 Now let us try it out:

 from pyspark.mllib.clustering import MatrixEntry
 def convert_to_MatrixEntry(tuple):
   return MatrixEntry(*tuple)
 from pyspark.mllib.clustering import *
 pic = PowerIterationClusteringModel(2)
 tups = [(1,2,3),(4,5,6),(12,13,14),(15,7,8),(16,17,16.5)]
 trdd = sc.parallelize(map(convert_to_MatrixEntry,tups))
 # print out the RDD on python side just for validationprint %s 
 %(repr(trdd.collect()))
 from pyspark.mllib.common import callMLlibFunc
 pic = callMLlibFunc(foobar, trdd)

 Relevant portions of results:

 [(1,2)=3.0, (4,5)=6.0, (12,13)=14.0, (15,7)=8.0, (16,17)=16.5]

 which shows the input rdd is 'whole'. However the pickling was unhappy:

 5/04/27 21:15:44 ERROR Executor: Exception in task 6.0 in stage 1.0 (TID 14)
 net.razorvine.pickle.PickleException: expected zero arguments for 
 construction of ClassDict(for pyspark.mllib.clustering.MatrixEntry)
 at 
 net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
 at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
 at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
 at net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
 at net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)
 at 
 org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1167)
 at 
 org.apache.spark.mllib.api.python.SerDe$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(PythonMLLibAPI.scala:1166)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
 at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:819)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1523)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1523

Re: IntelliJ Runtime error

2015-04-04 Thread Stephen Boesch

Thanks Cheng. Yes, the problem is that the way to set up to run inside
Intellij changes v frequently.  It is unfortunately not simply a one-time
investment to get IJ debugging working properly: the steps required are a
moving target approximately monthly to bi-monthly.

Doing remote debugging is probably a good choice to reduce the dev
environment volatility/maintenance.



2015-04-04 5:46 GMT-07:00 Cheng Lian lian.cs@gmail.com:

 I found in general it's a pain to build/run Spark inside IntelliJ IDEA. I
 guess most people resort to this approach so that they can leverage the
 integrated debugger to debug and/or learn Spark internals. A more
 convenient way I'm using recently is resorting to the remote debugging
 feature. In this way, by adding driver/executor Java options, you may build
 and start the Spark applications/tests/daemons in the normal way and attach
 the debugger to it. I was using this to debug the HiveThriftServer2, and it
 worked perfectly.

 Steps to enable remote debugging:

 1. Menu Run / Edit configurations...
 2. Click the + button, choose Remote
 3. Choose Attach or Listen in Debugger mode according to your actual
 needs
 4. Copy, edit, and add Java options suggested in the dialog to
 `--driver-java-options` or `--executor-java-options`
 5. If you're using attaching mode, first start your Spark program, then
 start remote debugging in IDEA
 6. If you're using listening mode, first start remote debugging in IDEA,
 and then start your Spark program.

 Hope this can be helpful.

 Cheng


 On 4/4/15 12:54 AM, sara mustafa wrote:

 Thank you, it works with me when I changed the dependencies from provided
 to
 compile.



 --
 View this message in context: http://apache-spark-
 developers-list.1001551.n3.nabble.com/IntelliJ-Runtime-
 error-tp11383p11385.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: enum-like types in Spark

2015-03-04 Thread Stephen Boesch

#4 but with MemoryOnly (more scala-like)

http://docs.scala-lang.org/style/naming-conventions.html

Constants, Values, Variable and Methods

Constant names should be in upper camel case. That is, if the member is
final, immutable and it belongs to a package object or an object, it may be
considered a constant (similar to Java’sstatic final members):


   1. object Container {
   2. val MyConstant = ...
   3. }


2015-03-04 17:11 GMT-08:00 Xiangrui Meng men...@gmail.com:

 Hi all,

 There are many places where we use enum-like types in Spark, but in
 different ways. Every approach has both pros and cons. I wonder
 whether there should be an “official” approach for enum-like types in
 Spark.

 1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc)

 * All types show up as Enumeration.Value in Java.

 http://spark.apache.org/docs/latest/api/java/org/apache/spark/scheduler/SchedulingMode.html

 2. Java’s Enum (e.g., SaveMode, IOMode)

 * Implementation must be in a Java file.
 * Values doesn’t show up in the ScalaDoc:

 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.network.util.IOMode

 3. Static fields in Java (e.g., TripletFields)

 * Implementation must be in a Java file.
 * Doesn’t need “()” in Java code.
 * Values don't show up in the ScalaDoc:

 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.graphx.TripletFields

 4. Objects in Scala. (e.g., StorageLevel)

 * Needs “()” in Java code.
 * Values show up in both ScalaDoc and JavaDoc:

 http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.storage.StorageLevel$

 http://spark.apache.org/docs/latest/api/java/org/apache/spark/storage/StorageLevel.html

 It would be great if we have an “official” approach for this as well
 as the naming convention for enum-like values (“MEMORY_ONLY” or
 “MemoryOnly”). Personally, I like 4) with “MEMORY_ONLY”. Any thoughts?

 Best,
 Xiangrui

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: Broken record a bit here: building spark on intellij with sbt

2015-02-05 Thread Stephen Boesch

Hi Akhil
  Those instructions you provided are showing how to manually build an sbt
project that may include adding spark dependencies.  Whereas my OP was
about how to open the existing spark sbt project .  These two are not
similar tasks.

2015-02-04 21:46 GMT-08:00 Akhil Das ak...@sigmoidanalytics.com:

 Here's the sbt version
 https://docs.sigmoidanalytics.com/index.php/Step_by_Step_instructions_on_how_to_build_Spark_App_with_IntelliJ_IDEA


 Thanks
 Best Regards

 On Thu, Feb 5, 2015 at 8:55 AM, Stephen Boesch java...@gmail.com wrote:

 For building in intellij with sbt my mileage has varied widely: it had
 built as late as Monday (after the 1.3.0 release)  - and with zero
 'special' steps: just import as sbt project.

  However I can not presently repeat the process.  The wiki page has the
 latest instructions on how to build with maven - but not with sbt.  Is
 there a resource for that?


 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IntelliJ

 The error I see is same as from a post in July


 http://apache-spark-user-list.1001560.n3.nabble.com/Build-spark-with-Intellij-IDEA-13-td9904.html

 Here is an excerpt:

 uncaught exception during compilation: java.lang.AssertionError
 Error:scalac: Error: assertion failed:
 com.google.protobuf.InvalidProtocolBufferException
 java.lang.AssertionError: assertion failed:
 com.google.protobuf.InvalidProtocolBufferException
 at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1212)

 The answer in the mailing list to that thread was about using maven .. so
 that is not useful here.

Broken record a bit here: building spark on intellij with sbt

2015-02-04 Thread Stephen Boesch

For building in intellij with sbt my mileage has varied widely: it had
built as late as Monday (after the 1.3.0 release)  - and with zero
'special' steps: just import as sbt project.

 However I can not presently repeat the process.  The wiki page has the
latest instructions on how to build with maven - but not with sbt.  Is
there a resource for that?

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-IntelliJ

The error I see is same as from a post in July

http://apache-spark-user-list.1001560.n3.nabble.com/Build-spark-with-Intellij-IDEA-13-td9904.html

Here is an excerpt:

uncaught exception during compilation: java.lang.AssertionError
Error:scalac: Error: assertion failed:
com.google.protobuf.InvalidProtocolBufferException
java.lang.AssertionError: assertion failed:
com.google.protobuf.InvalidProtocolBufferException
at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1212)

The answer in the mailing list to that thread was about using maven .. so
that is not useful here.

Re: Building Spark with Pants

2015-02-02 Thread Stephen Boesch

There is a significant investment in sbt and maven - and they are not at
all likely to be going away. A third build tool? Note that there is also
the perspective of building within an IDE - which actually works presently
for sbt and with a little bit of tweaking with maven as well.

2015-02-02 16:25 GMT-08:00 Nicholas Chammas nicholas.cham...@gmail.com:

Does anyone here have experience with Pants
http://pantsbuild.github.io/index.html or interest in trying to build
Spark with it?

Pants has an interesting story. It was born at Twitter to help them build
their Scala, Java, and Python projects as several independent components in
one monolithic repo. (It was inspired by a similar build tool at Google
called blaze.) The mix of languages and sub-projects at Twitter seems
similar to the breakdown we have in Spark.

Pants has an interesting take on how a build system should work, and
Twitter and Foursquare (who use Pants as their primary build tool) claim it
helps enforce better build hygiene and maintainability.

Some relevant talks:

- Building Scala Hygienically with Pants
https://www.youtube.com/watch?v=ukqke8iTuH0
- The Pants Build Tool at Twitter

https://engineering.twitter.com/university/videos/the-pants-build-tool-at-twitter

- Getting Started with the Pants Build System: Why Pants?

https://engineering.twitter.com/university/videos/getting-started-with-the-pants-build-system-why-pants

At some point I may take a shot at converting Spark to use Pants as an
experiment and just see what it’s like.

Nick

Adding third party jars to classpath used by pyspark

2014-12-29 Thread Stephen Boesch

What is the recommended way to do this?  We have some native database
client libraries for which we are adding pyspark bindings.

The pyspark invokes spark-submit.   Do we add our libraries to
the SPARK_SUBMIT_LIBRARY_PATH ?

This issue relates back to an error we have been seeing Py4jError: Trying
to call a package - the suspicion being that the third party libraries may
not be available on the jvm side.

Re: Required file not found in building

2014-12-02 Thread Stephen Boesch

Thanks Sean, I followed suit (brew install zinc) and that is working.

2014-12-01 22:39 GMT-08:00 Sean Owen so...@cloudera.com:

 I'm having no problems with the build or zinc on my Mac. I use zinc
 from brew install zinc.

 On Tue, Dec 2, 2014 at 3:02 AM, Stephen Boesch java...@gmail.com wrote:
  Mac as well.  Just found the problem:  I had created an alias to zinc a
  couple of months back. Apparently that is not happy with the build
 anymore.
  No problem now that the issue has been isolated - just need to fix my
 zinc
  alias.
 
  2014-12-01 18:55 GMT-08:00 Ted Yu yuzhih...@gmail.com:
 
  I tried the same command on MacBook and didn't experience the same
 error.
 
  Which OS are you using ?
 
  Cheers
 
  On Mon, Dec 1, 2014 at 6:42 PM, Stephen Boesch java...@gmail.com
 wrote:
 
  It seems there were some additional settings required to build spark
 now .
  This should be a snap for most of you ot there about what I am missing.
  Here is the command line I have traditionally used:
 
 mvn -Pyarn -Phadoop-2.3 -Phive install compile package -DskipTests
 
  That command line is however failing with the lastest from HEAD:
 
  INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
  spark-network-common_2.10 ---
  [INFO] Using zinc server for incremental compilation
  [INFO] compiler plugin:
  BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
 
  *[error] Required file not found: scala-compiler-2.10.4.jar*
 
  *[error] See zinc -help for information about locating necessary files*
 
  [INFO]
 
 
  [INFO] Reactor Summary:
  [INFO]
  [INFO] Spark Project Parent POM .. SUCCESS
  [4.077s]
  [INFO] Spark Project Networking .. FAILURE
  [0.445s]
 
 
  OK let's try zinc -help:
 
  18:38:00/spark2 $*zinc -help*
  Nailgun server running with 1 cached compiler
 
  Version = 0.3.5.1
 
  Zinc compiler cache limit = 5
  Resident scalac cache limit = 0
  Analysis cache limit = 5
 
  Compiler(Scala 2.10.4) [74ff364f]
  Setup = {
  *   scala compiler =
 
 
 /Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar*
 scala library =
 
 
 /Users/steve/.m2/repository/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar
 scala extra = {
 
 
 
 /Users/steve/.m2/repository/org/scala-lang/scala-reflect/2.10.4/scala-reflect-2.10.4.jar
/shared/zinc-0.3.5.1/lib/scala-reflect.jar
 }
 sbt interface = /shared/zinc-0.3.5.1/lib/sbt-interface.jar
 compiler interface sources =
  /shared/zinc-0.3.5.1/lib/compiler-interface-sources.jar
 java home =
 fork java = false
 cache directory = /Users/steve/.zinc/0.3.5.1
  }
 
  Does that compiler jar exist?  Yes!
 
  18:39:34/spark2 $ll
 
 
 /Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar
  -rw-r--r--  1 steve  staff  14445780 Apr  9  2014
 
 
 /Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar

Required file not found in building

2014-12-01 Thread Stephen Boesch

It seems there were some additional settings required to build spark now .
This should be a snap for most of you ot there about what I am missing.
Here is the command line I have traditionally used:

   mvn -Pyarn -Phadoop-2.3 -Phive install compile package -DskipTests

That command line is however failing with the lastest from HEAD:

INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
spark-network-common_2.10 ---
[INFO] Using zinc server for incremental compilation
[INFO] compiler plugin:
BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)

*[error] Required file not found: scala-compiler-2.10.4.jar*

*[error] See zinc -help for information about locating necessary files*

[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM .. SUCCESS [4.077s]
[INFO] Spark Project Networking .. FAILURE [0.445s]


OK let's try zinc -help:

18:38:00/spark2 $*zinc -help*
Nailgun server running with 1 cached compiler

Version = 0.3.5.1

Zinc compiler cache limit = 5
Resident scalac cache limit = 0
Analysis cache limit = 5

Compiler(Scala 2.10.4) [74ff364f]
Setup = {
*   scala compiler =
/Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar*
   scala library =
/Users/steve/.m2/repository/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar
   scala extra = {

/Users/steve/.m2/repository/org/scala-lang/scala-reflect/2.10.4/scala-reflect-2.10.4.jar
  /shared/zinc-0.3.5.1/lib/scala-reflect.jar
   }
   sbt interface = /shared/zinc-0.3.5.1/lib/sbt-interface.jar
   compiler interface sources =
/shared/zinc-0.3.5.1/lib/compiler-interface-sources.jar
   java home =
   fork java = false
   cache directory = /Users/steve/.zinc/0.3.5.1
}

Does that compiler jar exist?  Yes!

18:39:34/spark2 $ll
/Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar
-rw-r--r--  1 steve  staff  14445780 Apr  9  2014
/Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar

Re: Required file not found in building

2014-12-01 Thread Stephen Boesch

Mac as well.  Just found the problem:  I had created an alias to zinc a
couple of months back. Apparently that is not happy with the build anymore.
No problem now that the issue has been isolated - just need to fix my zinc
alias.

2014-12-01 18:55 GMT-08:00 Ted Yu yuzhih...@gmail.com:

 I tried the same command on MacBook and didn't experience the same error.

 Which OS are you using ?

 Cheers

 On Mon, Dec 1, 2014 at 6:42 PM, Stephen Boesch java...@gmail.com wrote:

 It seems there were some additional settings required to build spark now .
 This should be a snap for most of you ot there about what I am missing.
 Here is the command line I have traditionally used:

mvn -Pyarn -Phadoop-2.3 -Phive install compile package -DskipTests

 That command line is however failing with the lastest from HEAD:

 INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
 spark-network-common_2.10 ---
 [INFO] Using zinc server for incremental compilation
 [INFO] compiler plugin:
 BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)

 *[error] Required file not found: scala-compiler-2.10.4.jar*

 *[error] See zinc -help for information about locating necessary files*

 [INFO]
 
 [INFO] Reactor Summary:
 [INFO]
 [INFO] Spark Project Parent POM .. SUCCESS
 [4.077s]
 [INFO] Spark Project Networking .. FAILURE
 [0.445s]


 OK let's try zinc -help:

 18:38:00/spark2 $*zinc -help*
 Nailgun server running with 1 cached compiler

 Version = 0.3.5.1

 Zinc compiler cache limit = 5
 Resident scalac cache limit = 0
 Analysis cache limit = 5

 Compiler(Scala 2.10.4) [74ff364f]
 Setup = {
 *   scala compiler =

 /Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar*
scala library =

 /Users/steve/.m2/repository/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar
scala extra = {


 /Users/steve/.m2/repository/org/scala-lang/scala-reflect/2.10.4/scala-reflect-2.10.4.jar
   /shared/zinc-0.3.5.1/lib/scala-reflect.jar
}
sbt interface = /shared/zinc-0.3.5.1/lib/sbt-interface.jar
compiler interface sources =
 /shared/zinc-0.3.5.1/lib/compiler-interface-sources.jar
java home =
fork java = false
cache directory = /Users/steve/.zinc/0.3.5.1
 }

 Does that compiler jar exist?  Yes!

 18:39:34/spark2 $ll

 /Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar
 -rw-r--r--  1 steve  staff  14445780 Apr  9  2014

 /Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar

Re: Required file not found in building

2014-12-01 Thread Stephen Boesch

The zinc src zip for  0.3.5.3 was  downloaded  and exploded. Then I  ran
sbt dist/create .  zinc is being launched from
dist/target/zinc-0.3.5.3/bin/zinc

2014-12-01 20:12 GMT-08:00 Ted Yu yuzhih...@gmail.com:

 I use zinc 0.2.0 and started zinc with the same command shown below.

 I don't observe such error.

 How did you install zinc-0.3.5.3 ?

 Cheers

 On Mon, Dec 1, 2014 at 8:00 PM, Stephen Boesch java...@gmail.com wrote:


 Anyone maybe can assist on how to run zinc with the latest maven build?

 I am starting zinc as follows:

 /shared/zinc-0.3.5.3/dist/target/zinc-0.3.5.3/bin/zinc -scala-home
 $SCALA_HOME -nailed -start

 The pertinent env vars are:


 19:58:11/lib $echo $SCALA_HOME
 /shared/scala
 19:58:14/lib $which scala
 /shared/scala/bin/scala
 19:58:16/lib $scala -version
 Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL


 When I do *not *start zinc then the maven build works .. but v slowly
 since no incremental compiler available.

 When zinc is started as shown above then the error occurs on all of the
 modules except parent:


 [INFO] Using zinc server for incremental compilation
 [INFO] compiler plugin:
 BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
 [error] Required file not found: scala-compiler-2.10.4.jar
 [error] See zinc -help for information about locating necessary files

 2014-12-01 19:02 GMT-08:00 Stephen Boesch java...@gmail.com:

 Mac as well.  Just found the problem:  I had created an alias to zinc a
 couple of months back. Apparently that is not happy with the build anymore.
 No problem now that the issue has been isolated - just need to fix my zinc
 alias.

 2014-12-01 18:55 GMT-08:00 Ted Yu yuzhih...@gmail.com:

 I tried the same command on MacBook and didn't experience the same error.

 Which OS are you using ?

 Cheers

 On Mon, Dec 1, 2014 at 6:42 PM, Stephen Boesch java...@gmail.com
 wrote:

 It seems there were some additional settings required to build spark
 now .
 This should be a snap for most of you ot there about what I am missing.
 Here is the command line I have traditionally used:

mvn -Pyarn -Phadoop-2.3 -Phive install compile package -DskipTests

 That command line is however failing with the lastest from HEAD:

 INFO] --- scala-maven-plugin:3.2.0:compile (scala-compile-first) @
 spark-network-common_2.10 ---
 [INFO] Using zinc server for incremental compilation
 [INFO] compiler plugin:
 BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)

 *[error] Required file not found: scala-compiler-2.10.4.jar*

 *[error] See zinc -help for information about locating necessary files*

 [INFO]

 
 [INFO] Reactor Summary:
 [INFO]
 [INFO] Spark Project Parent POM .. SUCCESS
 [4.077s]
 [INFO] Spark Project Networking .. FAILURE
 [0.445s]


 OK let's try zinc -help:

 18:38:00/spark2 $*zinc -help*
 Nailgun server running with 1 cached compiler

 Version = 0.3.5.1

 Zinc compiler cache limit = 5
 Resident scalac cache limit = 0
 Analysis cache limit = 5

 Compiler(Scala 2.10.4) [74ff364f]
 Setup = {
 *   scala compiler =

 /Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar*
scala library =

 /Users/steve/.m2/repository/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar
scala extra = {


 /Users/steve/.m2/repository/org/scala-lang/scala-reflect/2.10.4/scala-reflect-2.10.4.jar
   /shared/zinc-0.3.5.1/lib/scala-reflect.jar
}
sbt interface = /shared/zinc-0.3.5.1/lib/sbt-interface.jar
compiler interface sources =
 /shared/zinc-0.3.5.1/lib/compiler-interface-sources.jar
java home =
fork java = false
cache directory = /Users/steve/.zinc/0.3.5.1
 }

 Does that compiler jar exist?  Yes!

 18:39:34/spark2 $ll

 /Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar
 -rw-r--r--  1 steve  staff  14445780 Apr  9  2014

 /Users/steve/.m2/repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar

Re: mvn or sbt for studying and developing Spark?

2014-11-16 Thread Stephen Boesch

HI Michael,
That insight is useful. Some thoughts:

* I moved from sbt to maven in June specifically due to Andrew Or's
describing mvn as the default build tool. Developers should keep in mind
that jenkins uses mvn so we need to run mvn before submitting PR's - even
if sbt were used for day to day dev work
* In addition, as Sean has alluded to, the Intellij seems to comprehend
the maven builds a bit more readily than sbt
* But for command line and day to day dev purposes: sbt sounds great to
use Those sound bites you provided about exposing built-in test databases
for hive and for displaying available testcases are sweet. Any
easy/convenient way to see more of those kinds of facilities available
through sbt ?

2014-11-16 13:23 GMT-08:00 Michael Armbrust mich...@databricks.com:

I'm going to have to disagree here. If you are building a release
distribution or integrating with legacy systems then maven is probably the
correct choice. However most of the core developers that I know use sbt,
and I think its a better choice for exploration and development overall.
That said, this probably falls into the category of a religious argument so
you might want to look at both options and decide for yourself.

In my experience the SBT build is significantly faster with less effort
(and I think sbt is still faster even if you go through the extra effort of
installing zinc) and easier to read. The console mode of sbt (just run
sbt/sbt and then a long running console session is started that will accept
further commands) is great for building individual subprojects or running
single test suites. In addition to being faster since its a long running
JVM, its got a lot of nice features like tab-completion for test case
names.

For example, if I wanted to see what test cases are available in the SQL
subproject you can do the following:

[marmbrus@michaels-mbp spark (tpcds)]$ sbt/sbt
[info] Loading project definition from
/Users/marmbrus/workspace/spark/project/project
[info] Loading project definition from

/Users/marmbrus/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
[info] Set current project to spark-parent (in build
file:/Users/marmbrus/workspace/spark/)
sql/test-only *tab*
--
org.apache.spark.sql.CachedTableSuite
org.apache.spark.sql.DataTypeSuite
org.apache.spark.sql.DslQuerySuite
org.apache.spark.sql.InsertIntoSuite
...

Another very useful feature is the development console, which starts an
interactive REPL including the most recent version of the code and a lot of
useful imports for some subprojects. For example in the hive subproject it
automatically sets up a temporary database with a bunch of test data
pre-loaded:

$ sbt/sbt hive/console
hive/console
...
import org.apache.spark.sql.hive._
import org.apache.spark.sql.hive.test.TestHive._
import org.apache.spark.sql.parquet.ParquetTestData
Welcome to Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_45).
Type in expressions to have them evaluated.
Type :help for more information.

scala sql(SELECT * FROM src).take(2)
res0: Array[org.apache.spark.sql.Row] = Array([238,val_238], [86,val_86])

Michael

On Sun, Nov 16, 2014 at 3:27 AM, Dinesh J. Weerakkody
dineshjweerakk...@gmail.com wrote:

Hi Stephen and Sean,

Thanks for correction.

On Sun, Nov 16, 2014 at 12:28 PM, Sean Owen so...@cloudera.com wrote:

No, the Maven build is the main one. I would use it unless you have a
need to use the SBT build in particular.
On Nov 16, 2014 2:58 AM, Dinesh J. Weerakkody
dineshjweerakk...@gmail.com wrote:

Hi Yiming,

I believe that both SBT and MVN is supported in SPARK, but SBT is
preferred
(I'm not 100% sure about this :) ). When I'm using MVN I got some
build
failures. After that used SBT and works fine.

You can go through these discussions regarding SBT vs MVN and learn
pros
and cons of both [1] [2].

[1]

http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Necessity-of-Maven-and-SBT-Build-in-Spark-td2315.html

[2]

https://groups.google.com/forum/#!msg/spark-developers/OxL268v0-Qs/fBeBY8zmh3oJ

Thanks,

On Sun, Nov 16, 2014 at 7:11 AM, Yiming (John) Zhang
sdi...@gmail.com
wrote:

Hi,

I am new in developing Spark and my current focus is about
co-scheduling of
spark tasks. However, I am confused with the building tools:
sometimes
the
documentation uses mvn but sometimes uses sbt.

So, my question is that which one is the preferred tool of Spark
community?
And what's the technical difference between them? Thank you!

Cheers,

Yiming

--
Thanks Best Regards,

*Dinesh J. Weerakkody*

--
Thanks Best Regards,

*Dinesh J. Weerakkody*

Re: sbt scala compiler crashes on spark-sql

2014-11-02 Thread Stephen Boesch

Yes I have seen this same error - and for team members as well - repeatedly
since June. A Patrick and Cheng mentioned, the next step is to do an sbt
clean

2014-11-02 19:37 GMT-08:00 Cheng Lian lian.cs@gmail.com:

I often see this when I first build the whole Spark project with SBT, then
modify some code and tries to build and debug within IDEA, or vice versa. A
clean rebuild can always solve this.

On Mon, Nov 3, 2014 at 11:28 AM, Patrick Wendell pwend...@gmail.com
wrote:

Does this happen if you clean and recompile? I've seen failures on and
off, but haven't been able to find one that I could reproduce from a
clean build such that we could hand it to the scala team.

- Patrick

On Sun, Nov 2, 2014 at 7:25 PM, Imran Rashid im...@therashids.com
wrote:
I'm finding the scala compiler crashes when I compile the spark-sql
project
in sbt. This happens in both the 1.1 branch and master (full error
below). The other projects build fine in sbt, and everything builds
fine
in maven. is there some sbt option I'm forgetting? Any one else
experiencing this?

Also, are there up-to-date instructions on how to do common dev tasks
in
both sbt maven? I have only found these instructions on building
with
maven:

http://spark.apache.org/docs/latest/building-with-maven.html

and some general info here:

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

but I think this doesn't walk through a lot of the steps of a typical
dev
cycle, eg, continuous compilation, running one test, running one main
class, etc. (especially since it seems like people still favor sbt for
dev.) If it doesn't already exist somewhere, I could try to put
together a
brief doc for how to do the basics. (I'm returning to spark dev after
a
little hiatus myself, and I'm hitting some stumbling blocks that are
probably common knowledge to everyone still dealing with it all the
time.)

thanks,
Imran

--
full crash info from sbt:

project sql
[info] Set current project to spark-sql (in build
file:/Users/imran/spark/spark/)
compile
[info] Compiling 62 Scala sources to
/Users/imran/spark/spark/sql/catalyst/target/scala-2.10/classes...
[info] Compiling 45 Scala sources and 39 Java sources to
/Users/imran/spark/spark/sql/core/target/scala-2.10/classes...
[error]
[error] while compiling:

/Users/imran/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/types/util/DataTypeConversions.scala
[error] during phase: jvm
[error] library version: version 2.10.4
[error] compiler version: version 2.10.4
[error] reconstructed args: -classpath

Re: Spark consulting

2014-10-31 Thread Stephen Boesch

May we please refrain from using spark mailing list for job inquiries.
Thanks.

2014-10-31 13:35 GMT-07:00 Alessandro Baretta alexbare...@gmail.com:

 Hello,

 Is anyone open to do some consulting work on Spark in San Mateo?

 Thanks.

 Alex

Re: best IDE for scala + spark development?

2014-10-30 Thread Stephen Boesch

HI Nabeel,
In what ways is the IJ version of scala repl enhanced? thx!

2014-10-30 3:41 GMT-07:00 nm3...@gmail.com:

IntelliJ idea scala plugin comes with an enhanced REPL. It's a pretty
decent option too.

Nabeel

On Oct 28, 2014, at 5:34 AM, Cheng Lian lian.cs@gmail.com wrote:

My two cents for Mac Vim/Emacs users. Fixed a Scala ctags Mac
compatibility
bug months ago, and you may want to use the most recent version here

https://github.com/scala/scala-dist/blob/master/tool-support/src/emacs/contrib/dot-ctags

On Tue, Oct 28, 2014 at 4:26 PM, Duy Huynh duy.huynh@gmail.com
wrote:

thanks everyone. i've been using vim and sbt recently, and i really
like
it. it's lightweight, fast. plus, ack, ctrl-t, nerdtre, etc. in vim do
all the good work.

but, as i'm not familiar with scala/spark api yet, i really wish to have
these two things in vim + sbt.

1. code completion as in intellij (typing long method / class name in
scala/spark isn't that fun!)

2. scala doc on the fly in the text editor (just so i don't have to
switch
back and forth between the text editor and the scala doc)

did anyone have experience with adding these 2 things to vim?

thanks!

On Mon, Oct 27, 2014 at 5:14 PM, Will Benton wi...@redhat.com wrote:

I'll chime in as yet another user who is extremely happy with sbt and a
text editor. (In my experience, running ack from the command line is
usually just as easy and fast as using an IDE's find-in-project
facility.)
You can, of course, extend editors with Scala-specific IDE-like
functionality (in particular, I am aware of -- but have not used --
ENSIME
for emacs or TextMate).

Since you're new to Scala, you may not know that you can run any sbt
command preceded by a tilde, which will watch files in your project and
run
the command when anything changes. Therefore, running ~compile from
the
sbt repl will get you most of the continuous syntax-checking
functionality
you can get from an IDE.

best,
wb

- Original Message -
From: ll duy.huynh@gmail.com
To: d...@spark.incubator.apache.org
Sent: Sunday, October 26, 2014 10:07:20 AM
Subject: best IDE for scala + spark development?

i'm new to both scala and spark. what IDE / dev environment do you
find
most
productive for writing code in scala with spark? is it just vim +
sbt?
or
does a full IDE like intellij works out better? thanks!

--
View this message in context:

http://apache-spark-developers-list.1001551.n3.nabble.com/best-IDE-for-scala-spark-development-tp8965.html
Sent from the Apache Spark Developers List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

HiveShim not found when building in Intellij

2014-10-28 Thread Stephen Boesch

I have run on the command line via maven and it is fine:

mvn   -Dscalastyle.failOnViolation=false -DskipTests -Pyarn -Phadoop-2.3
 compile package install


But with the latest code Intellij builds do not work. Following is one of
26 similar errors:


Error:(173, 38) not found: value HiveShim
  Option(tableParameters.get(HiveShim.getStatsSetupConstTotalSize))
 ^

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Stephen Boesch

Hi Matei,
  Until my latest pull from upstream/master it had not been necessary to
add the hive profile: is it now??

I am not using sbt gen-idea. The way to open in intellij has been to Open
the parent directory. IJ recognizes it as a maven project.

There are several steps to do surgery on the yarn-parent / yarn projects ,
then do a full rebuild.  That was working until one week ago.
Intellij/maven is presently broken in  two ways:  this hive shim (which may
yet hopefully be a small/simple fix - let us see) and  (2) the
NoClassDefFoundError
on ThreadFactoryBuilder from my prior emails -and which is quite a serious
 problem .

2014-10-28 19:46 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com:

 Hi Stephen,

 How did you generate your Maven workspace? You need to make sure the Hive
 profile is enabled for it. For example sbt/sbt -Phive gen-idea.

 Matei

  On Oct 28, 2014, at 7:42 PM, Stephen Boesch java...@gmail.com wrote:
 
  I have run on the command line via maven and it is fine:
 
  mvn   -Dscalastyle.failOnViolation=false -DskipTests -Pyarn -Phadoop-2.3
  compile package install
 
 
  But with the latest code Intellij builds do not work. Following is one of
  26 similar errors:
 
 
  Error:(173, 38) not found: value HiveShim
 
 Option(tableParameters.get(HiveShim.getStatsSetupConstTotalSize))
  ^

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Stephen Boesch

Thanks Patrick for the heads up.

I have not been successful to discover a combination of profiles (i.e.
enabling hive or hive-0.12.0 or hive-13.0) that works in Intellij with
maven. Anyone who knows how to handle this - a quick note here would be
appreciated.



2014-10-28 20:20 GMT-07:00 Patrick Wendell pwend...@gmail.com:

 Hey Stephen,

 In some cases in the maven build we now have pluggable source
 directories based on profiles using the maven build helper plug-in.
 This is necessary to support cross building against different Hive
 versions, and there will be additional instances of this due to
 supporting scala 2.11 and 2.10.

 In these cases, you may need to add source locations explicitly to
 intellij if you want the entire project to compile there.

 Unfortunately as long as we support cross-building like this, it will
 be an issue. Intellij's maven support does not correctly detect our
 use of the maven-build-plugin to add source directories.

 We should come up with a good set of instructions on how to import the
 pom files + add the few extra source directories. Off hand I am not
 sure exactly what the correct sequence is.

 - Patrick

 On Tue, Oct 28, 2014 at 7:57 PM, Stephen Boesch java...@gmail.com wrote:
  Hi Matei,
Until my latest pull from upstream/master it had not been necessary to
  add the hive profile: is it now??
 
  I am not using sbt gen-idea. The way to open in intellij has been to Open
  the parent directory. IJ recognizes it as a maven project.
 
  There are several steps to do surgery on the yarn-parent / yarn projects
 ,
  then do a full rebuild.  That was working until one week ago.
  Intellij/maven is presently broken in  two ways:  this hive shim (which
 may
  yet hopefully be a small/simple fix - let us see) and  (2) the
  NoClassDefFoundError
  on ThreadFactoryBuilder from my prior emails -and which is quite a
 serious
   problem .
 
  2014-10-28 19:46 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com:
 
  Hi Stephen,
 
  How did you generate your Maven workspace? You need to make sure the
 Hive
  profile is enabled for it. For example sbt/sbt -Phive gen-idea.
 
  Matei
 
   On Oct 28, 2014, at 7:42 PM, Stephen Boesch java...@gmail.com
 wrote:
  
   I have run on the command line via maven and it is fine:
  
   mvn   -Dscalastyle.failOnViolation=false -DskipTests -Pyarn
 -Phadoop-2.3
   compile package install
  
  
   But with the latest code Intellij builds do not work. Following is
 one of
   26 similar errors:
  
  
   Error:(173, 38) not found: value HiveShim
  
  Option(tableParameters.get(HiveShim.getStatsSetupConstTotalSize))
   ^

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Stephen Boesch

I am interested specifically in how to build (and hopefully run/debug..)
under Intellij.  Your posts sound like command line maven - which has
always been working already.

Do you have instructions for building in IJ?

2014-10-28 21:38 GMT-07:00 Cheng Lian lian.cs@gmail.com:

 Yes, these two combinations work for me.


 On 10/29/14 12:32 PM, Zhan Zhang wrote:

 -Phive is to enable hive-0.13.1 and -Phive -Phive-0.12.0” is to enable
 hive-0.12.0. Note that the thrift-server is not supported yet in hive-0.13,
 but expected to go to upstream soon (Spark-3720).

 Thanks.

 Zhan Zhang


   On Oct 28, 2014, at 9:09 PM, Stephen Boesch java...@gmail.com wrote:

  Thanks Patrick for the heads up.

 I have not been successful to discover a combination of profiles (i.e.
 enabling hive or hive-0.12.0 or hive-13.0) that works in Intellij with
 maven. Anyone who knows how to handle this - a quick note here would be
 appreciated.



 2014-10-28 20:20 GMT-07:00 Patrick Wendell pwend...@gmail.com:

  Hey Stephen,

 In some cases in the maven build we now have pluggable source
 directories based on profiles using the maven build helper plug-in.
 This is necessary to support cross building against different Hive
 versions, and there will be additional instances of this due to
 supporting scala 2.11 and 2.10.

 In these cases, you may need to add source locations explicitly to
 intellij if you want the entire project to compile there.

 Unfortunately as long as we support cross-building like this, it will
 be an issue. Intellij's maven support does not correctly detect our
 use of the maven-build-plugin to add source directories.

 We should come up with a good set of instructions on how to import the
 pom files + add the few extra source directories. Off hand I am not
 sure exactly what the correct sequence is.

 - Patrick

 On Tue, Oct 28, 2014 at 7:57 PM, Stephen Boesch java...@gmail.com
 wrote:

 Hi Matei,
   Until my latest pull from upstream/master it had not been necessary
 to
 add the hive profile: is it now??

 I am not using sbt gen-idea. The way to open in intellij has been to
 Open
 the parent directory. IJ recognizes it as a maven project.

 There are several steps to do surgery on the yarn-parent / yarn
 projects

 ,

 then do a full rebuild.  That was working until one week ago.
 Intellij/maven is presently broken in  two ways:  this hive shim (which

 may

 yet hopefully be a small/simple fix - let us see) and  (2) the
 NoClassDefFoundError
 on ThreadFactoryBuilder from my prior emails -and which is quite a

 serious

 problem .

 2014-10-28 19:46 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com:

  Hi Stephen,

 How did you generate your Maven workspace? You need to make sure the

 Hive

 profile is enabled for it. For example sbt/sbt -Phive gen-idea.

 Matei

  On Oct 28, 2014, at 7:42 PM, Stephen Boesch java...@gmail.com

 wrote:

 I have run on the command line via maven and it is fine:

 mvn   -Dscalastyle.failOnViolation=false -DskipTests -Pyarn

 -Phadoop-2.3

 compile package install


 But with the latest code Intellij builds do not work. Following is

 one of

 26 similar errors:


 Error:(173, 38) not found: value HiveShim

  Option(tableParameters.get(HiveShim.getStatsSetupConstTotalSize))

 ^

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Stephen Boesch

I have selected the same options as Cheng LIang: hadoop-2.4, hive, hive
0.12.0 .  After  a full Rebuild in IJ I  still see the HiveShim errors.

I really do not know what is different. I had pulled three hours ago from
github upstream master.

Just for kicks i am trying PW's combination which uses 0.13.1 now.. But it
appears there is something else going on here.

Patrick/ Cheng:  did you build on the command line using Maven first?  I do
that since in the past that had been required.

2014-10-28 21:57 GMT-07:00 Patrick Wendell pwend...@gmail.com:

 I just started a totally fresh IntelliJ project importing from our
 root pom. I used all the default options and I added hadoop-2.4,
 hive, hive-0.13.1 profiles. I was able to run spark core tests from
 within IntelliJ. Didn't try anything beyond that, but FWIW this
 worked.

 - Patrick

 On Tue, Oct 28, 2014 at 9:54 PM, Cheng Lian lian.cs@gmail.com wrote:
  You may first open the root pom.xml file in IDEA, and then go for menu
 View
  / Tool Windows / Maven Projects, then choose desired Maven profile
  combination under the Profiles node (e.g. I usually use hadoop-2.4 +
 hive
  + hive-0.12.0). IDEA will ask you to re-import the Maven projects,
 confirm,
  then it should be OK.
 
  I can debug within IDEA with this approach. However, you have to clean
 the
  whole project before debugging Spark within IDEA if you compiled the
 project
  outside IDEA. Haven't got time to investigate this annoying issue.
 
  Also, you can remove sub projects unrelated to your tasks to accelerate
  compilation and/or avoid other IDEA build issues (e.g. Avro related Spark
  streaming build failure in IDEA).
 
 
  On 10/29/14 12:42 PM, Stephen Boesch wrote:
 
  I am interested specifically in how to build (and hopefully run/debug..)
  under Intellij.  Your posts sound like command line maven - which has
 always
  been working already.
 
  Do you have instructions for building in IJ?
 
  2014-10-28 21:38 GMT-07:00 Cheng Lian lian.cs@gmail.com:
 
  Yes, these two combinations work for me.
 
 
  On 10/29/14 12:32 PM, Zhan Zhang wrote:
 
  -Phive is to enable hive-0.13.1 and -Phive -Phive-0.12.0 is to enable
  hive-0.12.0. Note that the thrift-server is not supported yet in
 hive-0.13,
  but expected to go to upstream soon (Spark-3720).
 
  Thanks.
 
  Zhan Zhang
 
 
On Oct 28, 2014, at 9:09 PM, Stephen Boesch java...@gmail.com
 wrote:
 
  Thanks Patrick for the heads up.
 
  I have not been successful to discover a combination of profiles (i.e.
  enabling hive or hive-0.12.0 or hive-13.0) that works in Intellij with
  maven. Anyone who knows how to handle this - a quick note here would
 be
  appreciated.
 
 
 
  2014-10-28 20:20 GMT-07:00 Patrick Wendell pwend...@gmail.com:
 
  Hey Stephen,
 
  In some cases in the maven build we now have pluggable source
  directories based on profiles using the maven build helper plug-in.
  This is necessary to support cross building against different Hive
  versions, and there will be additional instances of this due to
  supporting scala 2.11 and 2.10.
 
  In these cases, you may need to add source locations explicitly to
  intellij if you want the entire project to compile there.
 
  Unfortunately as long as we support cross-building like this, it will
  be an issue. Intellij's maven support does not correctly detect our
  use of the maven-build-plugin to add source directories.
 
  We should come up with a good set of instructions on how to import
 the
  pom files + add the few extra source directories. Off hand I am not
  sure exactly what the correct sequence is.
 
  - Patrick
 
  On Tue, Oct 28, 2014 at 7:57 PM, Stephen Boesch java...@gmail.com
  wrote:
 
  Hi Matei,
Until my latest pull from upstream/master it had not been
 necessary
  to
  add the hive profile: is it now??
 
  I am not using sbt gen-idea. The way to open in intellij has been to
  Open
  the parent directory. IJ recognizes it as a maven project.
 
  There are several steps to do surgery on the yarn-parent / yarn
  projects
 
  ,
 
  then do a full rebuild.  That was working until one week ago.
  Intellij/maven is presently broken in  two ways:  this hive shim
  (which
 
  may
 
  yet hopefully be a small/simple fix - let us see) and  (2) the
  NoClassDefFoundError
  on ThreadFactoryBuilder from my prior emails -and which is quite a
 
  serious
 
  problem .
 
  2014-10-28 19:46 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com:
 
  Hi Stephen,
 
  How did you generate your Maven workspace? You need to make sure
 the
 
  Hive
 
  profile is enabled for it. For example sbt/sbt -Phive gen-idea.
 
  Matei
 
  On Oct 28, 2014, at 7:42 PM, Stephen Boesch java...@gmail.com
 
  wrote:
 
  I have run on the command line via maven and it is fine:
 
  mvn   -Dscalastyle.failOnViolation=false -DskipTests -Pyarn
 
  -Phadoop-2.3
 
  compile package install
 
 
  But with the latest code Intellij builds do not work. Following is
 
  one of
 
  26 similar errors:
 
 
  Error

Re: HiveShim not found when building in Intellij

2014-10-28 Thread Stephen Boesch

Thanks guys - adding the source root for the shim manually was the issue.

For some reason the other issue I  was struggling with
(NoCLassDefFoundError on ThreadFactoryBuilder) also disappeared. I am able
to run tests now inside IJ.  Woot

2014-10-28 22:13 GMT-07:00 Patrick Wendell pwend...@gmail.com:

 Oops - I actually should have added v0.13.0 (i.e. to match whatever I
 did in the profile).

 On Tue, Oct 28, 2014 at 10:05 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Cheng - to make it recognize the new HiveShim for 0.12 I had to click
  on spark-hive under packages in the left pane, then go to Open
  Module Settings - then explicitly add the v0.12.0/src/main/scala
  folder to the sources by navigating to it and then ctrl+click to add
  it as a source. Did you have to do this?
 
  On Tue, Oct 28, 2014 at 9:57 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  I just started a totally fresh IntelliJ project importing from our
  root pom. I used all the default options and I added hadoop-2.4,
  hive, hive-0.13.1 profiles. I was able to run spark core tests from
  within IntelliJ. Didn't try anything beyond that, but FWIW this
  worked.
 
  - Patrick
 
  On Tue, Oct 28, 2014 at 9:54 PM, Cheng Lian lian.cs@gmail.com
 wrote:
  You may first open the root pom.xml file in IDEA, and then go for menu
 View
  / Tool Windows / Maven Projects, then choose desired Maven profile
  combination under the Profiles node (e.g. I usually use hadoop-2.4 +
 hive
  + hive-0.12.0). IDEA will ask you to re-import the Maven projects,
 confirm,
  then it should be OK.
 
  I can debug within IDEA with this approach. However, you have to clean
 the
  whole project before debugging Spark within IDEA if you compiled the
 project
  outside IDEA. Haven't got time to investigate this annoying issue.
 
  Also, you can remove sub projects unrelated to your tasks to accelerate
  compilation and/or avoid other IDEA build issues (e.g. Avro related
 Spark
  streaming build failure in IDEA).
 
 
  On 10/29/14 12:42 PM, Stephen Boesch wrote:
 
  I am interested specifically in how to build (and hopefully
 run/debug..)
  under Intellij.  Your posts sound like command line maven - which has
 always
  been working already.
 
  Do you have instructions for building in IJ?
 
  2014-10-28 21:38 GMT-07:00 Cheng Lian lian.cs@gmail.com:
 
  Yes, these two combinations work for me.
 
 
  On 10/29/14 12:32 PM, Zhan Zhang wrote:
 
  -Phive is to enable hive-0.13.1 and -Phive -Phive-0.12.0 is to
 enable
  hive-0.12.0. Note that the thrift-server is not supported yet in
 hive-0.13,
  but expected to go to upstream soon (Spark-3720).
 
  Thanks.
 
  Zhan Zhang
 
 
On Oct 28, 2014, at 9:09 PM, Stephen Boesch java...@gmail.com
 wrote:
 
  Thanks Patrick for the heads up.
 
  I have not been successful to discover a combination of profiles
 (i.e.
  enabling hive or hive-0.12.0 or hive-13.0) that works in Intellij
 with
  maven. Anyone who knows how to handle this - a quick note here
 would be
  appreciated.
 
 
 
  2014-10-28 20:20 GMT-07:00 Patrick Wendell pwend...@gmail.com:
 
  Hey Stephen,
 
  In some cases in the maven build we now have pluggable source
  directories based on profiles using the maven build helper plug-in.
  This is necessary to support cross building against different Hive
  versions, and there will be additional instances of this due to
  supporting scala 2.11 and 2.10.
 
  In these cases, you may need to add source locations explicitly to
  intellij if you want the entire project to compile there.
 
  Unfortunately as long as we support cross-building like this, it
 will
  be an issue. Intellij's maven support does not correctly detect our
  use of the maven-build-plugin to add source directories.
 
  We should come up with a good set of instructions on how to import
 the
  pom files + add the few extra source directories. Off hand I am not
  sure exactly what the correct sequence is.
 
  - Patrick
 
  On Tue, Oct 28, 2014 at 7:57 PM, Stephen Boesch java...@gmail.com
 
  wrote:
 
  Hi Matei,
Until my latest pull from upstream/master it had not been
 necessary
  to
  add the hive profile: is it now??
 
  I am not using sbt gen-idea. The way to open in intellij has been
 to
  Open
  the parent directory. IJ recognizes it as a maven project.
 
  There are several steps to do surgery on the yarn-parent / yarn
  projects
 
  ,
 
  then do a full rebuild.  That was working until one week ago.
  Intellij/maven is presently broken in  two ways:  this hive shim
  (which
 
  may
 
  yet hopefully be a small/simple fix - let us see) and  (2) the
  NoClassDefFoundError
  on ThreadFactoryBuilder from my prior emails -and which is quite
 a
 
  serious
 
  problem .
 
  2014-10-28 19:46 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com
 :
 
  Hi Stephen,
 
  How did you generate your Maven workspace? You need to make sure
 the
 
  Hive
 
  profile is enabled for it. For example sbt/sbt -Phive gen-idea.
 
  Matei
 
  On Oct 28, 2014, at 7

Re: best IDE for scala + spark development?

2014-10-26 Thread Stephen Boesch

Many of the spark developers use Intellij.   You will in any case probably
want a full IDE (either IJ or eclipse)

2014-10-26 8:07 GMT-07:00 ll duy.huynh@gmail.com:

 i'm new to both scala and spark.  what IDE / dev environment do you find
 most
 productive for writing code in scala with spark?  is it just vim + sbt?  or
 does a full IDE like intellij works out better?  thanks!



 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/best-IDE-for-scala-spark-development-tp8965.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Re: scalastyle annoys me a little bit

2014-10-24 Thread Stephen Boesch

Sean Owen beat me to (strongly) recommending running zinc server.  Using
the -pl option is great too - but be careful to only use it when your work
is restricted to the modules in the (comma separated) list you provide to
-pl.   Also before using -pl you should do a  mvn compile package install
on all modules.  Use the -pl after those steps are done - and then it is
very effective.

2014-10-24 13:08 GMT-07:00 Sean Owen so...@cloudera.com:

 On Fri, Oct 24, 2014 at 8:59 PM, Koert Kuipers ko...@tresata.com wrote:
  mvn clean package -DskipTests takes about 30 mins for me. thats painful
  since its needed for the tests. does anyone know any tricks to speed it
 up?
  (besides getting a better laptop). does zinc help?

 Zinc helps by about 50-100%. Worthwhile for sure. brew install zinc
 and zinc -start

  mvn test runs through the projects until one fails. then it skips the
 rest!
  since its very likely that i get a failure in some subproject, this means
  its nearly impossible to do a general test run and get a good sense of
 the
  status of the project. for example:

 You can mvn test -pl [module] to test just one module.
 It will also indicate to you that after a failure you can mvn test -rf
 :[module] to continue where it left off -- you can use this to resume
 at the next module.

 Or try -Dscalatest.testFailureIgnore=true if the mvn flags
 themselves don't work, for continuing after a test failure.

  [INFO]
  
  [INFO] Reactor Summary:
  [INFO]
  [INFO] Spark Project Parent POM .. SUCCESS
 [2.199s]
  [INFO] Spark Project Core  SUCCESS
  [39:43.028s]
  [INFO] Spark Project Bagel ... SUCCESS
 [42.569s]
  [INFO] Spark Project GraphX .. SUCCESS
  [3:22.104s]
  [INFO] Spark Project Streaming ... SUCCESS
  [7:12.592s]
  [INFO] Spark Project ML Library .. SUCCESS
  [10:32.682s]
  [INFO] Spark Project Tools ... SUCCESS
 [17.070s]
  [INFO] Spark Project Catalyst  SUCCESS
  [3:03.470s]
  [INFO] Spark Project SQL . SUCCESS
  [5:23.993s]
  [INFO] Spark Project Hive  FAILURE
  [2:08.387s]
  [INFO] Spark Project REPL  SKIPPED
  [INFO] Spark Project Assembly  SKIPPED
  [INFO] Spark Project External Twitter  SKIPPED
  [INFO] Spark Project External Kafka .. SKIPPED
  [INFO] Spark Project External Flume Sink . SKIPPED
  [INFO] Spark Project External Flume .. SKIPPED
  [INFO] Spark Project External ZeroMQ . SKIPPED
  [INFO] Spark Project External MQTT ... SKIPPED
  [INFO] Spark Project Examples  SKIPPED
 
  in this case i dont care about Hive, but i would have liked to see REPL
  run, and Kafka.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

Tiny curiosity question on closing the jdbc connection

2014-08-05 Thread Stephen Boesch

Within its compute.close method, the JdbcRDD class has this interesting
logic for closing jdbc connection:


  try {
if (null != conn  ! stmt.isClosed()) conn.close()
logInfo(closed connection)
  } catch {
case e: Exception = logWarning(Exception closing connection, e)
  }

Notice that the second check is on stmt  having been closed - not on the
connection.

I would wager this were not a simple oversight and there were some
motivation for this logic- curious if anyone would be able to shed some
light?   My particular interest is that I have written custom ORM's in jdbc
since late 90's  and never did it this way.

Re: Tiny curiosity question on closing the jdbc connection

2014-08-05 Thread Stephen Boesch

Thanks Reynold, Ted Yu did mention offline and I put in a jira already.
Another small concern: there appears to be no exception handling from the
creation of the prepared statement (line 74) through to the executeQuery
(line 86).   In case of error/exception it would seem to be leaking
connections (/statements).  If that were the case then I would include a
small patch for the exception trapping in that section of code as well.
 BTW I was looking at this code for another reason, not intending to be a
bother ;)




2014-08-05 13:03 GMT-07:00 Reynold Xin r...@databricks.com:

 I'm pretty sure it is an oversight. Would you like to submit a pull
 request to fix that?



 On Tue, Aug 5, 2014 at 12:14 PM, Stephen Boesch java...@gmail.com wrote:

 Within its compute.close method, the JdbcRDD class has this interesting
 logic for closing jdbc connection:


   try {
 if (null != conn  ! stmt.isClosed()) conn.close()
 logInfo(closed connection)
   } catch {
 case e: Exception = logWarning(Exception closing connection, e)
   }

 Notice that the second check is on stmt  having been closed - not on the
 connection.

 I would wager this were not a simple oversight and there were some
 motivation for this logic- curious if anyone would be able to shed some
 light?   My particular interest is that I have written custom ORM's in
 jdbc
 since late 90's  and never did it this way.

Re: Tiny curiosity question on closing the jdbc connection

2014-08-05 Thread Stephen Boesch

Hi yes that callback takes care of it. thanks!


2014-08-05 13:58 GMT-07:00 Cody Koeninger c...@koeninger.org:

 The stmt.isClosed just looks like stupidity on my part, no secret
 motivation :)  Thanks for noticing it.

 As for the leaking in the case of malformed statements, isn't that
 addressed by

 context.addOnCompleteCallback{ () = closeIfNeeded() }

 or am I misunderstanding?


 On Tue, Aug 5, 2014 at 3:15 PM, Reynold Xin r...@databricks.com wrote:

 Thanks. Those are definitely great problems to fix!



 On Tue, Aug 5, 2014 at 1:11 PM, Stephen Boesch java...@gmail.com wrote:

  Thanks Reynold, Ted Yu did mention offline and I put in a jira already.
  Another small concern: there appears to be no exception handling from
 the
  creation of the prepared statement (line 74) through to the executeQuery
  (line 86).   In case of error/exception it would seem to be leaking
  connections (/statements).  If that were the case then I would include a
  small patch for the exception trapping in that section of code as well.
   BTW I was looking at this code for another reason, not intending to be
 a
  bother ;)
 
 
 
 
  2014-08-05 13:03 GMT-07:00 Reynold Xin r...@databricks.com:
 
  I'm pretty sure it is an oversight. Would you like to submit a pull
  request to fix that?
 
 
 
  On Tue, Aug 5, 2014 at 12:14 PM, Stephen Boesch java...@gmail.com
  wrote:
 
  Within its compute.close method, the JdbcRDD class has this
 interesting
  logic for closing jdbc connection:
 
 
try {
  if (null != conn  ! stmt.isClosed()) conn.close()
  logInfo(closed connection)
} catch {
  case e: Exception = logWarning(Exception closing
 connection,
  e)
}
 
  Notice that the second check is on stmt  having been closed - not on
 the
  connection.
 
  I would wager this were not a simple oversight and there were some
  motivation for this logic- curious if anyone would be able to shed
 some
  light?   My particular interest is that I have written custom ORM's in
  jdbc
  since late 90's  and never did it this way.

Re: Tiny curiosity question on closing the jdbc connection

2014-08-05 Thread Stephen Boesch

The existing callback does take care of it:  within the DAGScheduler  there
is a finally block to ensure the callbacks are executed.

  try {
val result = job.func(taskContext, rdd.iterator(split, taskContext))
job.listener.taskSucceeded(0, result)
  } finally {
taskContext.executeOnCompleteCallbacks()
  }

So I have removed that exception handling code from the  PR and updated the
JIRA.


2014-08-05 14:01 GMT-07:00 Reynold Xin r...@databricks.com:

 Yes it is. I actually commented on it:
 https://github.com/apache/spark/pull/1792/files#r15840899



 On Tue, Aug 5, 2014 at 1:58 PM, Cody Koeninger c...@koeninger.org wrote:

 The stmt.isClosed just looks like stupidity on my part, no secret
 motivation :)  Thanks for noticing it.

 As for the leaking in the case of malformed statements, isn't that
 addressed by

 context.addOnCompleteCallback{ () = closeIfNeeded() }

 or am I misunderstanding?


 On Tue, Aug 5, 2014 at 3:15 PM, Reynold Xin r...@databricks.com wrote:

  Thanks. Those are definitely great problems to fix!
 
 
 
  On Tue, Aug 5, 2014 at 1:11 PM, Stephen Boesch java...@gmail.com
 wrote:
 
   Thanks Reynold, Ted Yu did mention offline and I put in a jira
 already.
   Another small concern: there appears to be no exception handling from
 the
   creation of the prepared statement (line 74) through to the
 executeQuery
   (line 86).   In case of error/exception it would seem to be leaking
   connections (/statements).  If that were the case then I would
 include a
   small patch for the exception trapping in that section of code as
 well.
BTW I was looking at this code for another reason, not intending to
 be a
   bother ;)
  
  
  
  
   2014-08-05 13:03 GMT-07:00 Reynold Xin r...@databricks.com:
  
   I'm pretty sure it is an oversight. Would you like to submit a pull
   request to fix that?
  
  
  
   On Tue, Aug 5, 2014 at 12:14 PM, Stephen Boesch java...@gmail.com
   wrote:
  
   Within its compute.close method, the JdbcRDD class has this
 interesting
   logic for closing jdbc connection:
  
  
 try {
   if (null != conn  ! stmt.isClosed()) conn.close()
   logInfo(closed connection)
 } catch {
   case e: Exception = logWarning(Exception closing
 connection,
   e)
 }
  
   Notice that the second check is on stmt  having been closed - not on
  the
   connection.
  
   I would wager this were not a simple oversight and there were some
   motivation for this logic- curious if anyone would be able to shed
 some
   light?   My particular interest is that I have written custom ORM's
 in
   jdbc
   since late 90's  and never did it this way.

Re: 'Proper' Build Tool

2014-07-28 Thread Stephen Boesch

Hi Steve,
  I had the opportunity to ask this question at the Summit to Andrew Orr.
 He mentioned that with 1.0 the recommended build tool is with maven. sbt
is however still supported. You will notice that the dependencies are now
completely handled within the maven pom.xml:  the SparkBuild.scala /sbt
reads the dependencies from the pom.xml.

Andrew further suggested to look at the make-distribution.sh to see the
recommended way to create builds.  Using mvn on the command line is fine
- but the aforementioned script provides a framework /guideline to set
things up properly.




2014-07-28 13:06 GMT-07:00 Steve Nunez snu...@hortonworks.com:

 Gents,

 It seem that until recently, building via sbt was a documented process in
 the 0.9 overview:

 http://spark.apache.org/docs/0.9.0/

 The section on building mentions using sbt/sbt assembly. However in the
 latest overview:

 http://spark.apache.org/docs/latest/index.html

 There¹s no mention of building with sbt.

 What¹s the recommended way to build? What are most people using in their
 daily workflow?

 Cheers,
 - SteveN





 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.

No such file or directory errors running tests

2014-07-27 Thread Stephen Boesch

I have pulled latest from github this afternoon.   There are many many
errors:

source_home/assembly/target/scala-2.10: No such file or directory

This causes many tests to fail.

Here is the command line I am running

mvn -Pyarn -Phadoop-2.3 -Phive package test

Re: No such file or directory errors running tests

2014-07-27 Thread Stephen Boesch

i Reynold,
  thanks for responding here. Yes I had looked at the building with maven
page in the past.  I have not noticed  that the package step must happen
*before *the test.  I had assumed it were a corequisite -as seen in my
command line.

So the following sequence appears to work fine (so far so good - well past
when the prior attempts failed):


 mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive clean package
mvn -Pyarn -Phadoop-2.3 -Phive test

AFA documentation,  yes adding another sentence to that same Building with
Maven page would likely be helpful to future generations.


2014-07-27 19:10 GMT-07:00 Reynold Xin r...@databricks.com:

 To run through all the tests you'd need to create the assembly jar first.


 I've seen this asked a few times. Maybe we should make it more obvious.



 http://spark.apache.org/docs/latest/building-with-maven.html

 Spark Tests in Maven

 Tests are run by default via the ScalaTest Maven plugin
 http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin.
 Some of the require Spark to be packaged first, so always run mvn package
  with -DskipTests the first time. You can then run the tests with mvn
 -Dhadoop.version=... test.

 The ScalaTest plugin also supports running only a specific test suite as
 follows:

 mvn -Dhadoop.version=... -DwildcardSuites=org.apache.spark.repl.ReplSuite
 test





 On Sun, Jul 27, 2014 at 7:07 PM, Stephen Boesch java...@gmail.com wrote:

  I have pulled latest from github this afternoon.   There are many many
  errors:
 
  source_home/assembly/target/scala-2.10: No such file or directory
 
  This causes many tests to fail.
 
  Here is the command line I am running
 
  mvn -Pyarn -Phadoop-2.3 -Phive package test

Re: No such file or directory errors running tests

2014-07-27 Thread Stephen Boesch

 OK i'll do it after confirming all the tests run


2014-07-27 19:36 GMT-07:00 Reynold Xin r...@databricks.com:

 Would you like to submit a pull request? All doc source code are in the
 docs folder. Cheers.



 On Sun, Jul 27, 2014 at 7:35 PM, Stephen Boesch java...@gmail.com wrote:

  i Reynold,
thanks for responding here. Yes I had looked at the building with maven
  page in the past.  I have not noticed  that the package step must
 happen
  *before *the test.  I had assumed it were a corequisite -as seen in my
  command line.
 
  So the following sequence appears to work fine (so far so good - well
 past
  when the prior attempts failed):
 
 
   mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive clean package
  mvn -Pyarn -Phadoop-2.3 -Phive test
 
  AFA documentation,  yes adding another sentence to that same Building
 with
  Maven page would likely be helpful to future generations.
 
 
  2014-07-27 19:10 GMT-07:00 Reynold Xin r...@databricks.com:
 
   To run through all the tests you'd need to create the assembly jar
 first.
  
  
   I've seen this asked a few times. Maybe we should make it more obvious.
  
  
  
   http://spark.apache.org/docs/latest/building-with-maven.html
  
   Spark Tests in Maven
  
   Tests are run by default via the ScalaTest Maven plugin
   http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin
 .
   Some of the require Spark to be packaged first, so always run mvn
 package
with -DskipTests the first time. You can then run the tests with mvn
   -Dhadoop.version=... test.
  
   The ScalaTest plugin also supports running only a specific test suite
 as
   follows:
  
   mvn -Dhadoop.version=...
 -DwildcardSuites=org.apache.spark.repl.ReplSuite
   test
  
  
  
  
  
   On Sun, Jul 27, 2014 at 7:07 PM, Stephen Boesch java...@gmail.com
  wrote:
  
I have pulled latest from github this afternoon.   There are many
 many
errors:
   
source_home/assembly/target/scala-2.10: No such file or directory
   
This causes many tests to fail.
   
Here is the command line I am running
   
mvn -Pyarn -Phadoop-2.3 -Phive package test

Re: No such file or directory errors running tests

2014-07-27 Thread Stephen Boesch

Hi Steve,
  I am running on the cdh5.0.0 VM (which is CentOS 6.5)   Given the
difference in O/S and Hadoop distro between us my results are not likely to
be of direct help to you. But in any case i will let you know (likely
offline).


2014-07-27 20:02 GMT-07:00 Steve Nunez snu...@hortonworks.com:

 Whilst we¹re on this topic, I¹d be interested to see if you get hive
 failures. I¹m trying to build on a Mac using HDP and seem to be getting
 failures related to Parquet. I¹ll know for sure once I get in tomorrow and
 confirm with engineering, but this is likely because the version of Hive
 is 0.12.0, and Parquet is only supported in Hive 0.13 (HDP is 0.13)

 Any idea on what it would take to bump the Hive version up to the latest?

 Regards,
 - SteveN



 On 7/27/14, 19:39, Stephen Boesch java...@gmail.com wrote:

  OK i'll do it after confirming all the tests run
 
 
 2014-07-27 19:36 GMT-07:00 Reynold Xin r...@databricks.com:
 
  Would you like to submit a pull request? All doc source code are in the
  docs folder. Cheers.
 
 
 
  On Sun, Jul 27, 2014 at 7:35 PM, Stephen Boesch java...@gmail.com
 wrote:
 
   i Reynold,
 thanks for responding here. Yes I had looked at the building with
 maven
   page in the past.  I have not noticed  that the package step must
  happen
   *before *the test.  I had assumed it were a corequisite -as seen in my
   command line.
  
   So the following sequence appears to work fine (so far so good - well
  past
   when the prior attempts failed):
  
  
mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive clean package
   mvn -Pyarn -Phadoop-2.3 -Phive test
  
   AFA documentation,  yes adding another sentence to that same Building
  with
   Maven page would likely be helpful to future generations.
  
  
   2014-07-27 19:10 GMT-07:00 Reynold Xin r...@databricks.com:
  
To run through all the tests you'd need to create the assembly jar
  first.
   
   
I've seen this asked a few times. Maybe we should make it more
 obvious.
   
   
   
http://spark.apache.org/docs/latest/building-with-maven.html
   
Spark Tests in Maven
   
Tests are run by default via the ScalaTest Maven plugin
   
 http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin
  .
Some of the require Spark to be packaged first, so always run mvn
  package
 with -DskipTests the first time. You can then run the tests with
 mvn
-Dhadoop.version=... test.
   
The ScalaTest plugin also supports running only a specific test
 suite
  as
follows:
   
mvn -Dhadoop.version=...
  -DwildcardSuites=org.apache.spark.repl.ReplSuite
test
   
   
   
   
   
On Sun, Jul 27, 2014 at 7:07 PM, Stephen Boesch java...@gmail.com
   wrote:
   
 I have pulled latest from github this afternoon.   There are many
  many
 errors:

 source_home/assembly/target/scala-2.10: No such file or
 directory

 This causes many tests to fail.

 Here is the command line I am running

 mvn -Pyarn -Phadoop-2.3 -Phive package test

   
  
 



 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.

Re: SQLQuerySuite error

2014-07-24 Thread Stephen Boesch

OK I did find my error.  The missing step:

  mvn install

I should have republished (mvn install) all of the other modules .

The mvn -pl  will rely on the modules locally published and so the latest
code that I had git pull'ed was not being used (except  the sql/core module
code).

The tests are passing after having properly performed the mvn install
before  running with the mvn -pl sql/core.




2014-07-24 12:04 GMT-07:00 Stephen Boesch java...@gmail.com:


 Are other developers seeing the following error for the recently added
 substr() method?  If not, any ideas why the following invocation of tests
 would be failing for me - i.e. how the given invocation would need to be
 tweaked?

 mvn -Pyarn -Pcdh5 test  -pl sql/core
 -DwildcardSuites=org.apache.spark.sql.SQLQuerySuite

 (note cdh5 is a custom profile for cdh5.0.0 but should not be affecting
 these results)

 Only the test(SPARK-2407 Added Parser of SQL SUBSTR()) fails: all of the
 other 33 tests pass.

 SQLQuerySuite:
 - SPARK-2041 column name equals tablename
 - SPARK-2407 Added Parser of SQL SUBSTR() *** FAILED ***
   Exception thrown while executing query:
   == Logical Plan ==
   java.lang.UnsupportedOperationException
   == Optimized Logical Plan ==
   java.lang.UnsupportedOperationException
   == Physical Plan ==
   java.lang.UnsupportedOperationException
   == Exception ==
   java.lang.UnsupportedOperationException
   java.lang.UnsupportedOperationException
   at
 org.apache.spark.sql.catalyst.analysis.EmptyFunctionRegistry$.lookupFunction(FunctionRegistry.scala:33)
   at
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$5$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:131)
   at
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$5$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:129)
   at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
   at
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
   at
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
   at org.apache.spark.sql.catalyst.plans.QueryPlan.org
 $apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:52)
   at
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:66)
   at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at

Current way to include hive in a build

2014-07-17 Thread Stephen Boesch

Having looked at trunk make-distribution.sh the --with-hive and --with-yarn
are now deprecated.

Here is the way I have built it:

Added to pom.xml:

   profile
  idcdh5/id
  activation
activeByDefaultfalse/activeByDefault
  /activation
  properties
hadoop.version2.3.0-cdh5.0.0/hadoop.version
yarn.version2.3.0-cdh5.0.0/yarn.version
hbase.version0.96.1.1-cdh5.0.0/hbase.version
zookeeper.version3.4.5-cdh5.0.0/zookeeper.version
  /properties
/profile

*mvn -Pyarn -Pcdh5 -Phive -Dhadoop.version=2.3.0-cdh5.0.1
-Dyarn.version=2.3.0-cdh5.0.0 -DskipTests clean package*


[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM .. SUCCESS [3.165s]
[INFO] Spark Project Core  SUCCESS
[2:39.504s]
[INFO] Spark Project Bagel ... SUCCESS [7.596s]
[INFO] Spark Project GraphX .. SUCCESS [22.027s]
[INFO] Spark Project ML Library .. SUCCESS [36.284s]
[INFO] Spark Project Streaming ... SUCCESS [24.309s]
[INFO] Spark Project Tools ... SUCCESS [3.147s]
[INFO] Spark Project Catalyst  SUCCESS [20.148s]
[INFO] Spark Project SQL . SUCCESS [18.560s]
*[INFO] Spark Project Hive  FAILURE
[33.962s]*

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-dependency-plugin:2.4:copy-dependencies
(copy-dependencies) on project spark-hive_2.10: Execution copy-dependencies
of goal
org.apache.maven.plugins:maven-dependency-plugin:2.4:copy-dependencies
failed: Plugin org.apache.maven.plugins:maven-dependency-plugin:2.4 or one
of its dependencies could not be resolved: Could not find artifact
commons-logging:commons-logging:jar:1.0.4 - [Help 1]

Anyone who is presently building with -Phive and has a suggestion for this?

63 matches

Mail list logo