Re: matrix computation in spark

2014-11-17 Thread Reza Zadeh
Hi Yuxi,

We are integrating the ml-matrix from the AMPlab repo into MLlib, tracked
by this JIRA: https://issues.apache.org/jira/browse/SPARK-3434

We already have matrix multiply, but are missing LU decomposition. Could
you please track that JIRA, once the initial design is in, we can sync on
how to contribute LU decomposition.

Let's move the discussion to the JIRA.

Thanks!

On Mon, Nov 17, 2014 at 9:49 PM, 顾荣  wrote:

> Hey Yuxi,
>
> We also have implemented a distributed matrix multiplication library in
> PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We
> implemented three distributed matrix multiplication algorithms on Spark. As
> we see, communication-optimal does not always means the total-optimal.
> Thus, besides the CARMA matrix multiplication you mentioned, we also
> implemented the Block-splitting matrix multiplication and Broadcast matrix
> multiplication. They are more efficient than the CARMA matrix
> multiplication for some situations, for example a large matrix multiplies a
> small matrix.
>
> Actually, We have shared the work on Spark Meetup@Beijing on October
> 26th.(
> http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/ ). The
> slide can be download from the archive here
> http://pan.baidu.com/s/1dDoyHX3#path=%252Fmeetup-3rd
>
> Best,
> Rong
>
> 2014-11-18 13:11 GMT+08:00 顾荣 :
>
> > Hey Yuxi,
> >
> > We also have implemented a distributed matrix multiplication library in
> > PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We
> > implemented three distributed matrix multiplication algorithms on Spark.
> As
> > we see, communication-optimal does not always means the total-optimal.
> > Thus, besides the CARMA matrix multiplication you mentioned, we also
> > implemented the Block-splitting matrix multiplication and Broadcast
> matrix
> > multiplication. They are more efficient than the CARMA matrix
> > multiplication for some situations, for example a large matrix
> multiplies a
> > small matrix.
> >
> > Actually, We have shared the work on Spark Meetup@Beijing on October
> > 26th.( http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/
> > ). The slide is also attached in this mail.
> >
> > Best,
> > Rong
> >
> > 2014-11-18 11:36 GMT+08:00 Zongheng Yang :
> >
> >> There's been some work at the AMPLab on a distributed matrix library on
> >> top
> >> of Spark; see here [1]. In particular, the repo contains a couple
> >> factorization algorithms.
> >>
> >> [1] https://github.com/amplab/ml-matrix
> >>
> >> Zongheng
> >>
> >> On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi  wrote:
> >>
> >> > Hi,
> >> > Matrix computation is critical for algorithm efficiency like least
> >> square,
> >> > Kalman filter and so on.
> >> > For now, the mllib module offers limited linear algebra on matrix,
> >> > especially for distributed matrix.
> >> >
> >> > We have been working on establishing distributed matrix computation
> APIs
> >> > based on data structures in MLlib.
> >> > The main idea is to partition the matrix into sub-blocks, based on the
> >> > strategy in the following paper.
> >> > http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
> >> > In our experiment, it's communication-optimal.
> >> > But operations like factorization may not be appropriate to carry out
> in
> >> > blocks.
> >> >
> >> > Any suggestions and guidance are welcome.
> >> >
> >> > Thanks,
> >> > Yuxi
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > --
> > Rong Gu
> > Department of Computer Science and Technology
> > State Key Laboratory for Novel Software Technology
> > Nanjing University
> > Phone: +86 15850682791
> > Email: gurongwal...@gmail.com
> > Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/
> >
>
>
>
> --
> --
> Rong Gu
> Department of Computer Science and Technology
> State Key Laboratory for Novel Software Technology
> Nanjing University
> Phone: +86 15850682791
> Email: gurongwal...@gmail.com
> Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/
>


答复: matrix computation in spark

2014-11-17 Thread liaoyuxi
Hi,
I checked the work of ml-matrix. For now, it doesn’t include matrix multiply 
and LU decomposition. What’s your plan? Can we contribute our work to these 
parts?
Otherwise, the block number of row/column is decided manually, As we mentioned, 
the CARMA method in paper is communication-optimal.

发件人: Zongheng Yang [mailto:zonghen...@gmail.com]
发送时间: 2014年11月18日 11:37
收件人: liaoyuxi; d...@spark.incubator.apache.org
抄送: Shivaram Venkataraman
主题: Re: matrix computation in spark

There's been some work at the AMPLab on a distributed matrix library on top of 
Spark; see here [1]. In particular, the repo contains a couple factorization 
algorithms.

[1] https://github.com/amplab/ml-matrix

Zongheng

On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi 
mailto:liaoy...@huawei.com>> wrote:
Hi,
Matrix computation is critical for algorithm efficiency like least square, 
Kalman filter and so on.
For now, the mllib module offers limited linear algebra on matrix, especially 
for distributed matrix.

We have been working on establishing distributed matrix computation APIs based 
on data structures in MLlib.
The main idea is to partition the matrix into sub-blocks, based on the strategy 
in the following paper.
http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
In our experiment, it's communication-optimal.
But operations like factorization may not be appropriate to carry out in blocks.

Any suggestions and guidance are welcome.

Thanks,
Yuxi


Re: matrix computation in spark

2014-11-17 Thread 顾荣
Hey Yuxi,

We also have implemented a distributed matrix multiplication library in
PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We
implemented three distributed matrix multiplication algorithms on Spark. As
we see, communication-optimal does not always means the total-optimal.
Thus, besides the CARMA matrix multiplication you mentioned, we also
implemented the Block-splitting matrix multiplication and Broadcast matrix
multiplication. They are more efficient than the CARMA matrix
multiplication for some situations, for example a large matrix multiplies a
small matrix.

Actually, We have shared the work on Spark Meetup@Beijing on October 26th.(
http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/ ). The
slide can be download from the archive here
http://pan.baidu.com/s/1dDoyHX3#path=%252Fmeetup-3rd

Best,
Rong

2014-11-18 13:11 GMT+08:00 顾荣 :

> Hey Yuxi,
>
> We also have implemented a distributed matrix multiplication library in
> PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We
> implemented three distributed matrix multiplication algorithms on Spark. As
> we see, communication-optimal does not always means the total-optimal.
> Thus, besides the CARMA matrix multiplication you mentioned, we also
> implemented the Block-splitting matrix multiplication and Broadcast matrix
> multiplication. They are more efficient than the CARMA matrix
> multiplication for some situations, for example a large matrix multiplies a
> small matrix.
>
> Actually, We have shared the work on Spark Meetup@Beijing on October
> 26th.( http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/
> ). The slide is also attached in this mail.
>
> Best,
> Rong
>
> 2014-11-18 11:36 GMT+08:00 Zongheng Yang :
>
>> There's been some work at the AMPLab on a distributed matrix library on
>> top
>> of Spark; see here [1]. In particular, the repo contains a couple
>> factorization algorithms.
>>
>> [1] https://github.com/amplab/ml-matrix
>>
>> Zongheng
>>
>> On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi  wrote:
>>
>> > Hi,
>> > Matrix computation is critical for algorithm efficiency like least
>> square,
>> > Kalman filter and so on.
>> > For now, the mllib module offers limited linear algebra on matrix,
>> > especially for distributed matrix.
>> >
>> > We have been working on establishing distributed matrix computation APIs
>> > based on data structures in MLlib.
>> > The main idea is to partition the matrix into sub-blocks, based on the
>> > strategy in the following paper.
>> > http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
>> > In our experiment, it's communication-optimal.
>> > But operations like factorization may not be appropriate to carry out in
>> > blocks.
>> >
>> > Any suggestions and guidance are welcome.
>> >
>> > Thanks,
>> > Yuxi
>> >
>> >
>>
>
>
>
> --
> --
> Rong Gu
> Department of Computer Science and Technology
> State Key Laboratory for Novel Software Technology
> Nanjing University
> Phone: +86 15850682791
> Email: gurongwal...@gmail.com
> Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/
>



-- 
--
Rong Gu
Department of Computer Science and Technology
State Key Laboratory for Novel Software Technology
Nanjing University
Phone: +86 15850682791
Email: gurongwal...@gmail.com
Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/


Re: Quantile regression in tree models

2014-11-17 Thread Manish Amde
Hi Alessandro,

I think absolute error as splitting criterion might be feasible with the
current architecture -- I think the sufficient statistics we collect
currently might be able to support this. Could you let us know scenarios
where absolute error has significantly outperformed squared error for
regression trees? Also, what's your use case that makes squared error
undesirable.

For gradient boosting, you are correct. The weak hypothesis weights refer
to tree predictions in each of the branches. We plan to explain this in the
1.2 documentation and may be add some more clarifications to the Javadoc.

I will try to search for JIRAs or create new ones and update this thread.

-Manish

On Monday, November 17, 2014, Alessandro Baretta 
wrote:

> Manish,
>
> Thanks for pointing me to the relevant docs. It is unfortunate that
> absolute error is not supported yet. I can't seem to find a Jira for it.
>
> Now, here's the what the comments say in the current master branch:
> /**
>  * :: Experimental ::
>  * A class that implements Stochastic Gradient Boosting
>  * for regression and binary classification problems.
>  *
>  * The implementation is based upon:
>  *   J.H. Friedman.  "Stochastic Gradient Boosting."  1999.
>  *
>  * Notes:
>  *  - This currently can be run with several loss functions.  However,
> only SquaredError is
>  *fully supported.  Specifically, the loss function should be used to
> compute the gradient
>  *(to re-label training instances on each iteration) and to weight
> weak hypotheses.
>  *Currently, gradients are computed correctly for the available loss
> functions,
>  *but weak hypothesis weights are not computed correctly for LogLoss
> or AbsoluteError.
>  *Running with those losses will likely behave reasonably, but lacks
> the same guarantees.
> ...
> */
>
> By the looks of it, the GradientBoosting API would support an absolute
> error type loss function to perform quantile regression, except for "weak
> hypothesis weights". Does this refer to the weights of the leaves of the
> trees?
>
> Alex
>
> On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde  > wrote:
>
>> Hi Alessandro,
>>
>> MLlib v1.1 supports variance for regression and gini impurity and entropy
>> for classification.
>> http://spark.apache.org/docs/latest/mllib-decision-tree.html
>>
>> If the information gain calculation can be performed by distributed
>> aggregation then it might be possible to plug it into the existing
>> implementation. We want to perform such calculations (for e.g. median) for
>> the gradient boosting models (coming up in the 1.2 release) using absolute
>> error and deviance as loss functions but I don't think anyone is planning
>> to work on it yet. :-)
>>
>> -Manish
>>
>> On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta <
>> alexbare...@gmail.com
>> > wrote:
>>
>>> I see that, as of v. 1.1, MLLib supports regression and classification
>>> tree
>>> models. I assume this means that it uses a squared-error loss function
>>> for
>>> the first and logistic cost function for the second. I don't see support
>>> for quantile regression via an absolute error cost function. Or am I
>>> missing something?
>>>
>>> If, as it seems, this is missing, how do you recommend to implement it?
>>>
>>> Alex
>>>
>>
>>
>


Re: matrix computation in spark

2014-11-17 Thread Zongheng Yang
There's been some work at the AMPLab on a distributed matrix library on top
of Spark; see here [1]. In particular, the repo contains a couple
factorization algorithms.

[1] https://github.com/amplab/ml-matrix

Zongheng

On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi  wrote:

> Hi,
> Matrix computation is critical for algorithm efficiency like least square,
> Kalman filter and so on.
> For now, the mllib module offers limited linear algebra on matrix,
> especially for distributed matrix.
>
> We have been working on establishing distributed matrix computation APIs
> based on data structures in MLlib.
> The main idea is to partition the matrix into sub-blocks, based on the
> strategy in the following paper.
> http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
> In our experiment, it's communication-optimal.
> But operations like factorization may not be appropriate to carry out in
> blocks.
>
> Any suggestions and guidance are welcome.
>
> Thanks,
> Yuxi
>
>


matrix computation in spark

2014-11-17 Thread liaoyuxi
Hi,
Matrix computation is critical for algorithm efficiency like least square, 
Kalman filter and so on.
For now, the mllib module offers limited linear algebra on matrix, especially 
for distributed matrix.

We have been working on establishing distributed matrix computation APIs based 
on data structures in MLlib.
The main idea is to partition the matrix into sub-blocks, based on the strategy 
in the following paper.
http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
In our experiment, it's communication-optimal.
But operations like factorization may not be appropriate to carry out in blocks.

Any suggestions and guidance are welcome.

Thanks,
Yuxi



Using sampleByKey

2014-11-17 Thread Debasish Das
Hi,

I have a rdd whose key is a userId and value is (movieId, rating)...

I want to sample 80% of the (movieId,rating) that each userId has seen for
train, rest is for test...

val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2))

val keyedRatings = indexedRating.map{x => (x.product, (x.user, x.rating))}

val keyedTraining = keyedRatings.sample(true, 0.8, 1L)

val keyedTest = keyedRatings.subtract(keyedTraining)

blocks = sc.maxParallelism

println(s"Rating keys ${keyedRatings.groupByKey(blocks).count()}")

println(s"Training keys ${keyedTraining.groupByKey(blocks).count()}")

println(s"Test keys ${keyedTest.groupByKey(blocks).count()}")

My expectation was that the println will produce exact number of keys for
keyedRatings, keyedTraining and keyedTest but this is not the case...

On MovieLens for example I am noticing the following:

Rating keys 3706

Training keys 3676

Test keys 3470

I also tried sampleByKey as follows:

val keyedRatings = indexedRating.map{x => (x.product, (x.user, x.rating))}

val fractions = keyedRatings.map{x=> (x._1, 0.8)}.collect.toMap

val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L)

val keyedTest = keyedRatings.subtract(keyedTraining)

Still I get the results as:

Rating keys 3706

Training keys 3682

Test keys 3459

Any idea what's is wrong here...

Are my assumptions about behavior of sample/sampleByKey on a key-value RDD
correct ? If this is a bug I can dig deeper...

Thanks.

Deb


Re: Quantile regression in tree models

2014-11-17 Thread Alessandro Baretta
Manish,

Thanks for pointing me to the relevant docs. It is unfortunate that
absolute error is not supported yet. I can't seem to find a Jira for it.

Now, here's the what the comments say in the current master branch:
/**
 * :: Experimental ::
 * A class that implements Stochastic Gradient Boosting
 * for regression and binary classification problems.
 *
 * The implementation is based upon:
 *   J.H. Friedman.  "Stochastic Gradient Boosting."  1999.
 *
 * Notes:
 *  - This currently can be run with several loss functions.  However, only
SquaredError is
 *fully supported.  Specifically, the loss function should be used to
compute the gradient
 *(to re-label training instances on each iteration) and to weight weak
hypotheses.
 *Currently, gradients are computed correctly for the available loss
functions,
 *but weak hypothesis weights are not computed correctly for LogLoss or
AbsoluteError.
 *Running with those losses will likely behave reasonably, but lacks
the same guarantees.
...
*/

By the looks of it, the GradientBoosting API would support an absolute
error type loss function to perform quantile regression, except for "weak
hypothesis weights". Does this refer to the weights of the leaves of the
trees?

Alex

On Mon, Nov 17, 2014 at 2:24 PM, Manish Amde  wrote:

> Hi Alessandro,
>
> MLlib v1.1 supports variance for regression and gini impurity and entropy
> for classification.
> http://spark.apache.org/docs/latest/mllib-decision-tree.html
>
> If the information gain calculation can be performed by distributed
> aggregation then it might be possible to plug it into the existing
> implementation. We want to perform such calculations (for e.g. median) for
> the gradient boosting models (coming up in the 1.2 release) using absolute
> error and deviance as loss functions but I don't think anyone is planning
> to work on it yet. :-)
>
> -Manish
>
> On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta <
> alexbare...@gmail.com> wrote:
>
>> I see that, as of v. 1.1, MLLib supports regression and classification
>> tree
>> models. I assume this means that it uses a squared-error loss function for
>> the first and logistic cost function for the second. I don't see support
>> for quantile regression via an absolute error cost function. Or am I
>> missing something?
>>
>> If, as it seems, this is missing, how do you recommend to implement it?
>>
>> Alex
>>
>
>


[VOTE][RESULT] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Andrew Or
This is canceled in favor of RC2 with the following blockers:

https://issues.apache.org/jira/browse/SPARK-4434
https://issues.apache.org/jira/browse/SPARK-3633

The latter one involves a regression from 1.0.2 to 1.1.0, NOT from 1.1.0 to
1.1.1. For this reason, we are currently investigating this issue but may
not necessarily block on this to release 1.1.1.


2014-11-17 10:42 GMT-08:00 Debasish Das :

Andrew,
>
> I put up 1.1.1 branch and I am getting shuffle failures while doing
> flatMap followed by groupBy...My cluster memory is less than the memory I
> need and therefore flatMap does around 400 GB of shuffle...memory is around
> 120 GB...
>
> 14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in stage 191.0 (TID
> 4084, istgbd020.hadoop.istg.verizon.com): FetchFailed(null, shuffleId=4,
> mapId=-1, reduceId=22)
>
> I searched on user-list and this issue has been found over there:
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-partitionBy-FetchFailed-td14760.html
>
> I wanted to make sure whether 1.1.1 does not have the same bug...-1 from
> me till we figure out the root cause...
>
> Thanks.
>
> Deb
>
> On Mon, Nov 17, 2014 at 10:33 AM, Andrew Or  wrote:
>
>> This seems like a legitimate blocker. We will cut another RC to include
>> the
>> revert.
>>
>> 2014-11-16 17:29 GMT-08:00 Kousuke Saruta :
>>
>> > Now I've finished to revert for SPARK-4434 and opened PR.
>> >
>> >
>> > (2014/11/16 17:08), Josh Rosen wrote:
>> >
>> >> -1
>> >>
>> >> I found a potential regression in 1.1.1 related to spark-submit and
>> >> cluster
>> >> deploy mode: https://issues.apache.org/jira/browse/SPARK-4434
>> >>
>> >> I think that this is worth fixing.
>> >>
>> >> On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian 
>> >> wrote:
>> >>
>> >>  +1
>> >>>
>> >>> Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues
>> >>> are
>> >>> fixed. Hive version inspection works as expected.
>> >>>
>> >>>
>> >>> On 11/15/14 8:25 AM, Zach Fry wrote:
>> >>>
>> >>>  +0
>> 
>>  I expect to start testing on Monday but won't have enough results to
>>  change
>>  my vote from +0
>>  until Monday night or Tuesday morning.
>> 
>>  Thanks,
>>  Zach
>> 
>> 
>> 
>>  --
>>  View this message in context: http://apache-spark-
>>  developers-list.1001551.n3.nabble.com/VOTE-Release-
>>  Apache-Spark-1-1-1-RC1-tp9311p9370.html
>>  Sent from the Apache Spark Developers List mailing list archive at
>>  Nabble.com.
>> 
>>  -
>>  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>  For additional commands, e-mail: dev-h...@spark.apache.org
>> 
>> 
>> 
>> 
>> -
>> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> >>> For additional commands, e-mail: dev-h...@spark.apache.org
>> >>>
>> >>>
>> >>>
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>> >
>>
>
>


Re: Quantile regression in tree models

2014-11-17 Thread Manish Amde
Hi Alessandro,

MLlib v1.1 supports variance for regression and gini impurity and entropy
for classification.
http://spark.apache.org/docs/latest/mllib-decision-tree.html

If the information gain calculation can be performed by distributed
aggregation then it might be possible to plug it into the existing
implementation. We want to perform such calculations (for e.g. median) for
the gradient boosting models (coming up in the 1.2 release) using absolute
error and deviance as loss functions but I don't think anyone is planning
to work on it yet. :-)

-Manish

On Mon, Nov 17, 2014 at 11:11 AM, Alessandro Baretta 
wrote:

> I see that, as of v. 1.1, MLLib supports regression and classification tree
> models. I assume this means that it uses a squared-error loss function for
> the first and logistic cost function for the second. I don't see support
> for quantile regression via an absolute error cost function. Or am I
> missing something?
>
> If, as it seems, this is missing, how do you recommend to implement it?
>
> Alex
>


Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Patrick Wendell
Hey Kevin,

If you are upgrading from 1.0.X to 1.1.X checkout the upgrade notes
here [1] - it could be that default changes caused a regression for
your workload. Do you still see a regression if you restore the
configuration changes?

It's great to hear specifically about issues like this, so please fork
a new thread and describe your workload if you see a regression. The
main focus of a patch release vote like this is to test regressions
against the previous release on the same line (e.g. 1.1.1 vs 1.1.0)
though of course we still want to be cognizant of 1.0-to-1.1
regressions and make sure we can address them down the road.

[1] https://spark.apache.org/releases/spark-release-1-1-0.html

On Mon, Nov 17, 2014 at 2:04 PM, Kevin Markey  wrote:
> +0 (non-binding)
>
> Compiled Spark, recompiled and ran application with 1.1.1 RC1 with Yarn,
> plain-vanilla Hadoop 2.3.0. No regressions.
>
> However, 12% to 22% increase in run time relative to 1.0.0 release.  (No
> other environment or configuration changes.)  Would have recommended +1 were
> it not for added latency.
>
> Not sure if added latency a function of 1.0 vs 1.1 or 1.0 vs 1.1.1 changes,
> as we've never tested with 1.1.0. But thought I'd share the results.  (This
> is somewhat disappointing.)
>
> Kevin Markey
>
>
> On 11/17/2014 11:42 AM, Debasish Das wrote:
>>
>> Andrew,
>>
>> I put up 1.1.1 branch and I am getting shuffle failures while doing
>> flatMap
>> followed by groupBy...My cluster memory is less than the memory I need and
>> therefore flatMap does around 400 GB of shuffle...memory is around 120
>> GB...
>>
>> 14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in stage 191.0 (TID
>> 4084, istgbd020.hadoop.istg.verizon.com): FetchFailed(null, shuffleId=4,
>> mapId=-1, reduceId=22)
>>
>> I searched on user-list and this issue has been found over there:
>>
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-partitionBy-FetchFailed-td14760.html
>>
>> I wanted to make sure whether 1.1.1 does not have the same bug...-1 from
>> me
>> till we figure out the root cause...
>>
>> Thanks.
>>
>> Deb
>>
>> On Mon, Nov 17, 2014 at 10:33 AM, Andrew Or  wrote:
>>
>>> This seems like a legitimate blocker. We will cut another RC to include
>>> the
>>> revert.
>>>
>>> 2014-11-16 17:29 GMT-08:00 Kousuke Saruta :
>>>
 Now I've finished to revert for SPARK-4434 and opened PR.


 (2014/11/16 17:08), Josh Rosen wrote:

> -1
>
> I found a potential regression in 1.1.1 related to spark-submit and
> cluster
> deploy mode: https://issues.apache.org/jira/browse/SPARK-4434
>
> I think that this is worth fixing.
>
> On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian 
> wrote:
>
>   +1
>>
>>
>> Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues
>> are
>> fixed. Hive version inspection works as expected.
>>
>>
>> On 11/15/14 8:25 AM, Zach Fry wrote:
>>
>>   +0
>>>
>>>
>>> I expect to start testing on Monday but won't have enough results to
>>> change
>>> my vote from +0
>>> until Monday night or Tuesday morning.
>>>
>>> Thanks,
>>> Zach
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-
>>> developers-list.1001551.n3.nabble.com/VOTE-Release-
>>> Apache-Spark-1-1-1-RC1-tp9311p9370.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>>
>>>
>>> -
>>
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>>

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


>>>
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Kevin Markey

+0 (non-binding)

Compiled Spark, recompiled and ran application with 1.1.1 RC1 with Yarn, 
plain-vanilla Hadoop 2.3.0. No regressions.


However, 12% to 22% increase in run time relative to 1.0.0 release.  (No 
other environment or configuration changes.)  Would have recommended +1 
were it not for added latency.


Not sure if added latency a function of 1.0 vs 1.1 or 1.0 vs 1.1.1 
changes, as we've never tested with 1.1.0. But thought I'd share the 
results.  (This is somewhat disappointing.)


Kevin Markey

On 11/17/2014 11:42 AM, Debasish Das wrote:

Andrew,

I put up 1.1.1 branch and I am getting shuffle failures while doing flatMap
followed by groupBy...My cluster memory is less than the memory I need and
therefore flatMap does around 400 GB of shuffle...memory is around 120 GB...

14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in stage 191.0 (TID
4084, istgbd020.hadoop.istg.verizon.com): FetchFailed(null, shuffleId=4,
mapId=-1, reduceId=22)

I searched on user-list and this issue has been found over there:

http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-partitionBy-FetchFailed-td14760.html

I wanted to make sure whether 1.1.1 does not have the same bug...-1 from me
till we figure out the root cause...

Thanks.

Deb

On Mon, Nov 17, 2014 at 10:33 AM, Andrew Or  wrote:


This seems like a legitimate blocker. We will cut another RC to include the
revert.

2014-11-16 17:29 GMT-08:00 Kousuke Saruta :


Now I've finished to revert for SPARK-4434 and opened PR.


(2014/11/16 17:08), Josh Rosen wrote:


-1

I found a potential regression in 1.1.1 related to spark-submit and
cluster
deploy mode: https://issues.apache.org/jira/browse/SPARK-4434

I think that this is worth fixing.

On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian 
wrote:

  +1


Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues
are
fixed. Hive version inspection works as expected.


On 11/15/14 8:25 AM, Zach Fry wrote:

  +0


I expect to start testing on Monday but won't have enough results to
change
my vote from +0
until Monday night or Tuesday morning.

Thanks,
Zach



--
View this message in context: http://apache-spark-
developers-list.1001551.n3.nabble.com/VOTE-Release-
Apache-Spark-1-1-1-RC1-tp9311p9370.html
Sent from the Apache Spark Developers List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



  -

To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org





-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org








-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: mvn or sbt for studying and developing Spark?

2014-11-17 Thread Nicholas Chammas
The docs on using sbt are here:
https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt

They'll be published with 1.2.0 presumably.
On 2014년 11월 17일 (월) at 오후 2:49 Michael Armbrust 
wrote:

> >
> > * I moved from sbt to maven in June specifically due to Andrew Or's
> > describing mvn as the default build tool.  Developers should keep in mind
> > that jenkins uses mvn so we need to run mvn before submitting PR's - even
> > if sbt were used for day to day dev work
> >
>
> To be clear, I think that the PR builder actually uses sbt
> 
> currently,
> but there are master builds that make sure maven doesn't break (amongst
> other things).
>
>
> > *  In addition, as Sean has alluded to, the Intellij seems to comprehend
> > the maven builds a bit more readily than sbt
> >
>
> Yeah, this is a very good point.  I have used `sbt/sbt gen-idea` in the
> past, but I'm currently using the maven integration of inteliJ since it
> seems more stable.
>
>
> > * But for command line and day to day dev purposes:  sbt sounds great to
> > use  Those sound bites you provided about exposing built-in test
> databases
> > for hive and for displaying available testcases are sweet.  Any
> > easy/convenient way to see "more of " those kinds of facilities available
> > through sbt ?
> >
>
> The Spark SQL developer readme
>  has a little bit of
> this,
> but we really should have some documentation on using SBT as well.
>
>  Integrating with those systems is generally easier if you are also working
> > with Spark in Maven.  (And I wouldn't classify all of those Maven-built
> > systems as "legacy", Michael :)
>
>
> Also a good point, though I've seen some pretty clever uses of sbt's
> external project references to link spark into other projects.  I'll
> certainly admit I have a bias towards new shiny things in general though,
> so my definition of legacy is probably skewed :)
>


Re: mvn or sbt for studying and developing Spark?

2014-11-17 Thread Michael Armbrust
>
> * I moved from sbt to maven in June specifically due to Andrew Or's
> describing mvn as the default build tool.  Developers should keep in mind
> that jenkins uses mvn so we need to run mvn before submitting PR's - even
> if sbt were used for day to day dev work
>

To be clear, I think that the PR builder actually uses sbt
 currently,
but there are master builds that make sure maven doesn't break (amongst
other things).


> *  In addition, as Sean has alluded to, the Intellij seems to comprehend
> the maven builds a bit more readily than sbt
>

Yeah, this is a very good point.  I have used `sbt/sbt gen-idea` in the
past, but I'm currently using the maven integration of inteliJ since it
seems more stable.


> * But for command line and day to day dev purposes:  sbt sounds great to
> use  Those sound bites you provided about exposing built-in test databases
> for hive and for displaying available testcases are sweet.  Any
> easy/convenient way to see "more of " those kinds of facilities available
> through sbt ?
>

The Spark SQL developer readme
 has a little bit of this,
but we really should have some documentation on using SBT as well.

 Integrating with those systems is generally easier if you are also working
> with Spark in Maven.  (And I wouldn't classify all of those Maven-built
> systems as "legacy", Michael :)


Also a good point, though I've seen some pretty clever uses of sbt's
external project references to link spark into other projects.  I'll
certainly admit I have a bias towards new shiny things in general though,
so my definition of legacy is probably skewed :)


Quantile regression in tree models

2014-11-17 Thread Alessandro Baretta
I see that, as of v. 1.1, MLLib supports regression and classification tree
models. I assume this means that it uses a squared-error loss function for
the first and logistic cost function for the second. I don't see support
for quantile regression via an absolute error cost function. Or am I
missing something?

If, as it seems, this is missing, how do you recommend to implement it?

Alex


Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Debasish Das
Andrew,

I put up 1.1.1 branch and I am getting shuffle failures while doing flatMap
followed by groupBy...My cluster memory is less than the memory I need and
therefore flatMap does around 400 GB of shuffle...memory is around 120 GB...

14/11/13 23:10:49 WARN TaskSetManager: Lost task 22.1 in stage 191.0 (TID
4084, istgbd020.hadoop.istg.verizon.com): FetchFailed(null, shuffleId=4,
mapId=-1, reduceId=22)

I searched on user-list and this issue has been found over there:

http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-partitionBy-FetchFailed-td14760.html

I wanted to make sure whether 1.1.1 does not have the same bug...-1 from me
till we figure out the root cause...

Thanks.

Deb

On Mon, Nov 17, 2014 at 10:33 AM, Andrew Or  wrote:

> This seems like a legitimate blocker. We will cut another RC to include the
> revert.
>
> 2014-11-16 17:29 GMT-08:00 Kousuke Saruta :
>
> > Now I've finished to revert for SPARK-4434 and opened PR.
> >
> >
> > (2014/11/16 17:08), Josh Rosen wrote:
> >
> >> -1
> >>
> >> I found a potential regression in 1.1.1 related to spark-submit and
> >> cluster
> >> deploy mode: https://issues.apache.org/jira/browse/SPARK-4434
> >>
> >> I think that this is worth fixing.
> >>
> >> On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian 
> >> wrote:
> >>
> >>  +1
> >>>
> >>> Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues
> >>> are
> >>> fixed. Hive version inspection works as expected.
> >>>
> >>>
> >>> On 11/15/14 8:25 AM, Zach Fry wrote:
> >>>
> >>>  +0
> 
>  I expect to start testing on Monday but won't have enough results to
>  change
>  my vote from +0
>  until Monday night or Tuesday morning.
> 
>  Thanks,
>  Zach
> 
> 
> 
>  --
>  View this message in context: http://apache-spark-
>  developers-list.1001551.n3.nabble.com/VOTE-Release-
>  Apache-Spark-1-1-1-RC1-tp9311p9370.html
>  Sent from the Apache Spark Developers List mailing list archive at
>  Nabble.com.
> 
>  -
>  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>  For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 
> 
>   -
> >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: dev-h...@spark.apache.org
> >>>
> >>>
> >>>
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>


Re: [VOTE] Release Apache Spark 1.1.1 (RC1)

2014-11-17 Thread Andrew Or
This seems like a legitimate blocker. We will cut another RC to include the
revert.

2014-11-16 17:29 GMT-08:00 Kousuke Saruta :

> Now I've finished to revert for SPARK-4434 and opened PR.
>
>
> (2014/11/16 17:08), Josh Rosen wrote:
>
>> -1
>>
>> I found a potential regression in 1.1.1 related to spark-submit and
>> cluster
>> deploy mode: https://issues.apache.org/jira/browse/SPARK-4434
>>
>> I think that this is worth fixing.
>>
>> On Fri, Nov 14, 2014 at 7:28 PM, Cheng Lian 
>> wrote:
>>
>>  +1
>>>
>>> Tested HiveThriftServer2 against Hive 0.12.0 on Mac OS X. Known issues
>>> are
>>> fixed. Hive version inspection works as expected.
>>>
>>>
>>> On 11/15/14 8:25 AM, Zach Fry wrote:
>>>
>>>  +0

 I expect to start testing on Monday but won't have enough results to
 change
 my vote from +0
 until Monday night or Tuesday morning.

 Thanks,
 Zach



 --
 View this message in context: http://apache-spark-
 developers-list.1001551.n3.nabble.com/VOTE-Release-
 Apache-Spark-1-1-1-RC1-tp9311p9370.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



  -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


[ANNOUNCE] Spark 1.2.0 Release Preview Posted

2014-11-17 Thread Patrick Wendell
Hi All,

I've just posted a preview of the Spark 1.2.0. release for community
regression testing.

Issues reported now will get close attention, so please help us test!
You can help by running an existing Spark 1.X workload on this and
reporting any regressions. As we start voting, etc, the bar for
reported issues to hold the release will get higher and higher, so
test early!

The tag to be is v1.2.0-snapshot1 (commit 38c1fbd96)

The release files, including signatures, digests, etc can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-snapshot1

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1038/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-snapshot1-docs/

== Notes ==
- Maven artifacts are published for both Scala 2.10 and 2.11. Binary
distributions are not posted for Scala 2.11 yet, but will be posted
soon.

- There are two significant config default changes that users may want
to revert if doing A:B testing against older versions.

"spark.shuffle.manager" default has changed to "sort" (was "hash")
"spark.shuffle.blockTransferService" default has changed to "netty" (was "nio")

- This release contains a shuffle service for YARN. This jar is
present in all Hadoop 2.X binary packages in
"lib/spark-1.2.0-yarn-shuffle.jar"

Cheers,
Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org