Re: get -101 error code when running select query

2014-04-23 Thread qingyang li
thanks for sharing,  my case is diffrent from yours,
i have set hive.server2.enable.doAs into false in  hive-site.xml,  then
that 101 error code disappeared.



2014-04-24 9:26 GMT+08:00 Madhu :

> I have seen a similar error message when connecting to Hive through JDBC.
> This is just a guess on my part, but check your query. The error occurs if
> you have a select that includes a null literal with an alias like this:
>
> select a, b, null as c, d from foo
>
> In my case, rewriting the query to use an empty string or other literal
> instead of null worked:
>
> select a, b, '' as c, d from foo
>
> I think the problem is the lack of type information when supplying a null
> literal.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/get-101-error-code-when-running-select-query-tp6377p6382.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>


Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread DB Tsai
The figure showing the Log-Likelihood vs Time can be found here.

https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf

Let me know if you can not open it. Thanks.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman
 wrote:
> I don't think the attachment came through in the list. Could you upload the
> results somewhere and link to them ?
>
>
> On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai  wrote:
>>
>> 123 features per rows, and in average, 89% are zeros.
>> On Apr 23, 2014 9:31 PM, "Evan Sparks"  wrote:
>>
>> > What is the number of non zeroes per row (and number of features) in the
>> > sparse case? We've hit some issues with breeze sparse support in the
>> > past
>> > but for sufficiently sparse data it's still pretty good.
>> >
>> > > On Apr 23, 2014, at 9:21 PM, DB Tsai  wrote:
>> > >
>> > > Hi all,
>> > >
>> > > I'm benchmarking Logistic Regression in MLlib using the newly added
>> > optimizer LBFGS and GD. I'm using the same dataset and the same
>> > methodology
>> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
>> > >
>> > > I want to know how Spark scale while adding workers, and how
>> > > optimizers
>> > and input format (sparse or dense) impact performance.
>> > >
>> > > The benchmark code can be found here,
>> > https://github.com/dbtsai/spark-lbfgs-benchmark
>> > >
>> > > The first dataset I benchmarked is a9a which only has 2.2MB. I
>> > duplicated the dataset, and made it 762MB to have 11M rows. This dataset
>> > has 123 features and 11% of the data are non-zero elements.
>> > >
>> > > In this benchmark, all the dataset is cached in memory.
>> > >
>> > > As we expect, LBFGS converges faster than GD, and at some point, no
>> > matter how we push GD, it will converge slower and slower.
>> > >
>> > > However, it's surprising that sparse format runs slower than dense
>> > format. I did see that sparse format takes significantly smaller amount
>> > of
>> > memory in caching RDD, but sparse is 40% slower than dense. I think
>> > sparse
>> > should be fast since when we compute x wT, since x is sparse, we can do
>> > it
>> > faster. I wonder if there is anything I'm doing wrong.
>> > >
>> > > The attachment is the benchmark result.
>> > >
>> > > Thanks.
>> > >
>> > > Sincerely,
>> > >
>> > > DB Tsai
>> > > ---
>> > > My Blog: https://www.dbtsai.com
>> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >
>
>


Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread DB Tsai
Not yet since it's running in the cluster. Will run locally with
profiler. Thanks for help.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, Apr 23, 2014 at 10:22 PM, David Hall  wrote:
> On Wed, Apr 23, 2014 at 10:18 PM, DB Tsai  wrote:
>>
>> ps, it doesn't make sense to have weight and gradient sparse unless
>> with strong L1 penalty.
>
>
> Sure, I was just checking the obvious things. Have you run it through it a
> profiler to see where the problem is?
>
>
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Wed, Apr 23, 2014 at 10:17 PM, DB Tsai  wrote:
>> > In mllib, the weight, and gradient are dense. Only feature is sparse.
>> >
>> > Sincerely,
>> >
>> > DB Tsai
>> > ---
>> > My Blog: https://www.dbtsai.com
>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >
>> >
>> > On Wed, Apr 23, 2014 at 10:16 PM, David Hall 
>> > wrote:
>> >> Was the weight vector sparse? The gradients? Or just the feature
>> >> vectors?
>> >>
>> >>
>> >> On Wed, Apr 23, 2014 at 10:08 PM, DB Tsai  wrote:
>> >>>
>> >>> The figure showing the Log-Likelihood vs Time can be found here.
>> >>>
>> >>>
>> >>>
>> >>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
>> >>>
>> >>> Let me know if you can not open it.
>> >>>
>> >>> Sincerely,
>> >>>
>> >>> DB Tsai
>> >>> ---
>> >>> My Blog: https://www.dbtsai.com
>> >>> LinkedIn: https://www.linkedin.com/in/dbtsai
>> >>>
>> >>>
>> >>> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman <
>> >>> shiva...@eecs.berkeley.edu> wrote:
>> >>>
>> >>> > I don't think the attachment came through in the list. Could you
>> >>> > upload
>> >>> > the results somewhere and link to them ?
>> >>> >
>> >>> >
>> >>> > On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai  wrote:
>> >>> >
>> >>> >> 123 features per rows, and in average, 89% are zeros.
>> >>> >> On Apr 23, 2014 9:31 PM, "Evan Sparks" 
>> >>> >> wrote:
>> >>> >>
>> >>> >> > What is the number of non zeroes per row (and number of features)
>> >>> >> > in
>> >>> >> > the
>> >>> >> > sparse case? We've hit some issues with breeze sparse support in
>> >>> >> > the
>> >>> >> past
>> >>> >> > but for sufficiently sparse data it's still pretty good.
>> >>> >> >
>> >>> >> > > On Apr 23, 2014, at 9:21 PM, DB Tsai 
>> >>> >> > > wrote:
>> >>> >> > >
>> >>> >> > > Hi all,
>> >>> >> > >
>> >>> >> > > I'm benchmarking Logistic Regression in MLlib using the newly
>> >>> >> > > added
>> >>> >> > optimizer LBFGS and GD. I'm using the same dataset and the same
>> >>> >> methodology
>> >>> >> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
>> >>> >> > >
>> >>> >> > > I want to know how Spark scale while adding workers, and how
>> >>> >> optimizers
>> >>> >> > and input format (sparse or dense) impact performance.
>> >>> >> > >
>> >>> >> > > The benchmark code can be found here,
>> >>> >> > https://github.com/dbtsai/spark-lbfgs-benchmark
>> >>> >> > >
>> >>> >> > > The first dataset I benchmarked is a9a which only has 2.2MB. I
>> >>> >> > duplicated the dataset, and made it 762MB to have 11M rows. This
>> >>> >> > dataset
>> >>> >> > has 123 features and 11% of the data are non-zero elements.
>> >>> >> > >
>> >>> >> > > In this benchmark, all the dataset is cached in memory.
>> >>> >> > >
>> >>> >> > > As we expect, LBFGS converges faster than GD, and at some
>> >>> >> > > point, no
>> >>> >> > matter how we push GD, it will converge slower and slower.
>> >>> >> > >
>> >>> >> > > However, it's surprising that sparse format runs slower than
>> >>> >> > > dense
>> >>> >> > format. I did see that sparse format takes significantly smaller
>> >>> >> > amount
>> >>> >> of
>> >>> >> > memory in caching RDD, but sparse is 40% slower than dense. I
>> >>> >> > think
>> >>> >> sparse
>> >>> >> > should be fast since when we compute x wT, since x is sparse, we
>> >>> >> > can
>> >>> >> > do
>> >>> >> it
>> >>> >> > faster. I wonder if there is anything I'm doing wrong.
>> >>> >> > >
>> >>> >> > > The attachment is the benchmark result.
>> >>> >> > >
>> >>> >> > > Thanks.
>> >>> >> > >
>> >>> >> > > Sincerely,
>> >>> >> > >
>> >>> >> > > DB Tsai
>> >>> >> > > ---
>> >>> >> > > My Blog: https://www.dbtsai.com
>> >>> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >>> >> >
>> >>> >>
>> >>> >
>> >>> >
>> >>
>> >>
>
>


Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread David Hall
On Wed, Apr 23, 2014 at 10:18 PM, DB Tsai  wrote:

> ps, it doesn't make sense to have weight and gradient sparse unless
> with strong L1 penalty.
>

Sure, I was just checking the obvious things. Have you run it through it a
profiler to see where the problem is?



>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Wed, Apr 23, 2014 at 10:17 PM, DB Tsai  wrote:
> > In mllib, the weight, and gradient are dense. Only feature is sparse.
> >
> > Sincerely,
> >
> > DB Tsai
> > ---
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
> >
> > On Wed, Apr 23, 2014 at 10:16 PM, David Hall 
> wrote:
> >> Was the weight vector sparse? The gradients? Or just the feature
> vectors?
> >>
> >>
> >> On Wed, Apr 23, 2014 at 10:08 PM, DB Tsai  wrote:
> >>>
> >>> The figure showing the Log-Likelihood vs Time can be found here.
> >>>
> >>>
> >>>
> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
> >>>
> >>> Let me know if you can not open it.
> >>>
> >>> Sincerely,
> >>>
> >>> DB Tsai
> >>> ---
> >>> My Blog: https://www.dbtsai.com
> >>> LinkedIn: https://www.linkedin.com/in/dbtsai
> >>>
> >>>
> >>> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman <
> >>> shiva...@eecs.berkeley.edu> wrote:
> >>>
> >>> > I don't think the attachment came through in the list. Could you
> upload
> >>> > the results somewhere and link to them ?
> >>> >
> >>> >
> >>> > On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai  wrote:
> >>> >
> >>> >> 123 features per rows, and in average, 89% are zeros.
> >>> >> On Apr 23, 2014 9:31 PM, "Evan Sparks" 
> wrote:
> >>> >>
> >>> >> > What is the number of non zeroes per row (and number of features)
> in
> >>> >> > the
> >>> >> > sparse case? We've hit some issues with breeze sparse support in
> the
> >>> >> past
> >>> >> > but for sufficiently sparse data it's still pretty good.
> >>> >> >
> >>> >> > > On Apr 23, 2014, at 9:21 PM, DB Tsai 
> wrote:
> >>> >> > >
> >>> >> > > Hi all,
> >>> >> > >
> >>> >> > > I'm benchmarking Logistic Regression in MLlib using the newly
> added
> >>> >> > optimizer LBFGS and GD. I'm using the same dataset and the same
> >>> >> methodology
> >>> >> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
> >>> >> > >
> >>> >> > > I want to know how Spark scale while adding workers, and how
> >>> >> optimizers
> >>> >> > and input format (sparse or dense) impact performance.
> >>> >> > >
> >>> >> > > The benchmark code can be found here,
> >>> >> > https://github.com/dbtsai/spark-lbfgs-benchmark
> >>> >> > >
> >>> >> > > The first dataset I benchmarked is a9a which only has 2.2MB. I
> >>> >> > duplicated the dataset, and made it 762MB to have 11M rows. This
> >>> >> > dataset
> >>> >> > has 123 features and 11% of the data are non-zero elements.
> >>> >> > >
> >>> >> > > In this benchmark, all the dataset is cached in memory.
> >>> >> > >
> >>> >> > > As we expect, LBFGS converges faster than GD, and at some
> point, no
> >>> >> > matter how we push GD, it will converge slower and slower.
> >>> >> > >
> >>> >> > > However, it's surprising that sparse format runs slower than
> dense
> >>> >> > format. I did see that sparse format takes significantly smaller
> >>> >> > amount
> >>> >> of
> >>> >> > memory in caching RDD, but sparse is 40% slower than dense. I
> think
> >>> >> sparse
> >>> >> > should be fast since when we compute x wT, since x is sparse, we
> can
> >>> >> > do
> >>> >> it
> >>> >> > faster. I wonder if there is anything I'm doing wrong.
> >>> >> > >
> >>> >> > > The attachment is the benchmark result.
> >>> >> > >
> >>> >> > > Thanks.
> >>> >> > >
> >>> >> > > Sincerely,
> >>> >> > >
> >>> >> > > DB Tsai
> >>> >> > > ---
> >>> >> > > My Blog: https://www.dbtsai.com
> >>> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai
> >>> >> >
> >>> >>
> >>> >
> >>> >
> >>
> >>
>


Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread DB Tsai
ps, it doesn't make sense to have weight and gradient sparse unless
with strong L1 penalty.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, Apr 23, 2014 at 10:17 PM, DB Tsai  wrote:
> In mllib, the weight, and gradient are dense. Only feature is sparse.
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Wed, Apr 23, 2014 at 10:16 PM, David Hall  wrote:
>> Was the weight vector sparse? The gradients? Or just the feature vectors?
>>
>>
>> On Wed, Apr 23, 2014 at 10:08 PM, DB Tsai  wrote:
>>>
>>> The figure showing the Log-Likelihood vs Time can be found here.
>>>
>>>
>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
>>>
>>> Let me know if you can not open it.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ---
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman <
>>> shiva...@eecs.berkeley.edu> wrote:
>>>
>>> > I don't think the attachment came through in the list. Could you upload
>>> > the results somewhere and link to them ?
>>> >
>>> >
>>> > On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai  wrote:
>>> >
>>> >> 123 features per rows, and in average, 89% are zeros.
>>> >> On Apr 23, 2014 9:31 PM, "Evan Sparks"  wrote:
>>> >>
>>> >> > What is the number of non zeroes per row (and number of features) in
>>> >> > the
>>> >> > sparse case? We've hit some issues with breeze sparse support in the
>>> >> past
>>> >> > but for sufficiently sparse data it's still pretty good.
>>> >> >
>>> >> > > On Apr 23, 2014, at 9:21 PM, DB Tsai  wrote:
>>> >> > >
>>> >> > > Hi all,
>>> >> > >
>>> >> > > I'm benchmarking Logistic Regression in MLlib using the newly added
>>> >> > optimizer LBFGS and GD. I'm using the same dataset and the same
>>> >> methodology
>>> >> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
>>> >> > >
>>> >> > > I want to know how Spark scale while adding workers, and how
>>> >> optimizers
>>> >> > and input format (sparse or dense) impact performance.
>>> >> > >
>>> >> > > The benchmark code can be found here,
>>> >> > https://github.com/dbtsai/spark-lbfgs-benchmark
>>> >> > >
>>> >> > > The first dataset I benchmarked is a9a which only has 2.2MB. I
>>> >> > duplicated the dataset, and made it 762MB to have 11M rows. This
>>> >> > dataset
>>> >> > has 123 features and 11% of the data are non-zero elements.
>>> >> > >
>>> >> > > In this benchmark, all the dataset is cached in memory.
>>> >> > >
>>> >> > > As we expect, LBFGS converges faster than GD, and at some point, no
>>> >> > matter how we push GD, it will converge slower and slower.
>>> >> > >
>>> >> > > However, it's surprising that sparse format runs slower than dense
>>> >> > format. I did see that sparse format takes significantly smaller
>>> >> > amount
>>> >> of
>>> >> > memory in caching RDD, but sparse is 40% slower than dense. I think
>>> >> sparse
>>> >> > should be fast since when we compute x wT, since x is sparse, we can
>>> >> > do
>>> >> it
>>> >> > faster. I wonder if there is anything I'm doing wrong.
>>> >> > >
>>> >> > > The attachment is the benchmark result.
>>> >> > >
>>> >> > > Thanks.
>>> >> > >
>>> >> > > Sincerely,
>>> >> > >
>>> >> > > DB Tsai
>>> >> > > ---
>>> >> > > My Blog: https://www.dbtsai.com
>>> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>> >> >
>>> >>
>>> >
>>> >
>>
>>


Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread DB Tsai
In mllib, the weight, and gradient are dense. Only feature is sparse.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, Apr 23, 2014 at 10:16 PM, David Hall  wrote:
> Was the weight vector sparse? The gradients? Or just the feature vectors?
>
>
> On Wed, Apr 23, 2014 at 10:08 PM, DB Tsai  wrote:
>>
>> The figure showing the Log-Likelihood vs Time can be found here.
>>
>>
>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
>>
>> Let me know if you can not open it.
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>> > I don't think the attachment came through in the list. Could you upload
>> > the results somewhere and link to them ?
>> >
>> >
>> > On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai  wrote:
>> >
>> >> 123 features per rows, and in average, 89% are zeros.
>> >> On Apr 23, 2014 9:31 PM, "Evan Sparks"  wrote:
>> >>
>> >> > What is the number of non zeroes per row (and number of features) in
>> >> > the
>> >> > sparse case? We've hit some issues with breeze sparse support in the
>> >> past
>> >> > but for sufficiently sparse data it's still pretty good.
>> >> >
>> >> > > On Apr 23, 2014, at 9:21 PM, DB Tsai  wrote:
>> >> > >
>> >> > > Hi all,
>> >> > >
>> >> > > I'm benchmarking Logistic Regression in MLlib using the newly added
>> >> > optimizer LBFGS and GD. I'm using the same dataset and the same
>> >> methodology
>> >> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
>> >> > >
>> >> > > I want to know how Spark scale while adding workers, and how
>> >> optimizers
>> >> > and input format (sparse or dense) impact performance.
>> >> > >
>> >> > > The benchmark code can be found here,
>> >> > https://github.com/dbtsai/spark-lbfgs-benchmark
>> >> > >
>> >> > > The first dataset I benchmarked is a9a which only has 2.2MB. I
>> >> > duplicated the dataset, and made it 762MB to have 11M rows. This
>> >> > dataset
>> >> > has 123 features and 11% of the data are non-zero elements.
>> >> > >
>> >> > > In this benchmark, all the dataset is cached in memory.
>> >> > >
>> >> > > As we expect, LBFGS converges faster than GD, and at some point, no
>> >> > matter how we push GD, it will converge slower and slower.
>> >> > >
>> >> > > However, it's surprising that sparse format runs slower than dense
>> >> > format. I did see that sparse format takes significantly smaller
>> >> > amount
>> >> of
>> >> > memory in caching RDD, but sparse is 40% slower than dense. I think
>> >> sparse
>> >> > should be fast since when we compute x wT, since x is sparse, we can
>> >> > do
>> >> it
>> >> > faster. I wonder if there is anything I'm doing wrong.
>> >> > >
>> >> > > The attachment is the benchmark result.
>> >> > >
>> >> > > Thanks.
>> >> > >
>> >> > > Sincerely,
>> >> > >
>> >> > > DB Tsai
>> >> > > ---
>> >> > > My Blog: https://www.dbtsai.com
>> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >> >
>> >>
>> >
>> >
>
>


Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread David Hall
Was the weight vector sparse? The gradients? Or just the feature vectors?


On Wed, Apr 23, 2014 at 10:08 PM, DB Tsai  wrote:

> The figure showing the Log-Likelihood vs Time can be found here.
>
>
> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf
>
> Let me know if you can not open it.
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
> > I don't think the attachment came through in the list. Could you upload
> > the results somewhere and link to them ?
> >
> >
> > On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai  wrote:
> >
> >> 123 features per rows, and in average, 89% are zeros.
> >> On Apr 23, 2014 9:31 PM, "Evan Sparks"  wrote:
> >>
> >> > What is the number of non zeroes per row (and number of features) in
> the
> >> > sparse case? We've hit some issues with breeze sparse support in the
> >> past
> >> > but for sufficiently sparse data it's still pretty good.
> >> >
> >> > > On Apr 23, 2014, at 9:21 PM, DB Tsai  wrote:
> >> > >
> >> > > Hi all,
> >> > >
> >> > > I'm benchmarking Logistic Regression in MLlib using the newly added
> >> > optimizer LBFGS and GD. I'm using the same dataset and the same
> >> methodology
> >> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
> >> > >
> >> > > I want to know how Spark scale while adding workers, and how
> >> optimizers
> >> > and input format (sparse or dense) impact performance.
> >> > >
> >> > > The benchmark code can be found here,
> >> > https://github.com/dbtsai/spark-lbfgs-benchmark
> >> > >
> >> > > The first dataset I benchmarked is a9a which only has 2.2MB. I
> >> > duplicated the dataset, and made it 762MB to have 11M rows. This
> dataset
> >> > has 123 features and 11% of the data are non-zero elements.
> >> > >
> >> > > In this benchmark, all the dataset is cached in memory.
> >> > >
> >> > > As we expect, LBFGS converges faster than GD, and at some point, no
> >> > matter how we push GD, it will converge slower and slower.
> >> > >
> >> > > However, it's surprising that sparse format runs slower than dense
> >> > format. I did see that sparse format takes significantly smaller
> amount
> >> of
> >> > memory in caching RDD, but sparse is 40% slower than dense. I think
> >> sparse
> >> > should be fast since when we compute x wT, since x is sparse, we can
> do
> >> it
> >> > faster. I wonder if there is anything I'm doing wrong.
> >> > >
> >> > > The attachment is the benchmark result.
> >> > >
> >> > > Thanks.
> >> > >
> >> > > Sincerely,
> >> > >
> >> > > DB Tsai
> >> > > ---
> >> > > My Blog: https://www.dbtsai.com
> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai
> >> >
> >>
> >
> >
>


Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread DB Tsai
The figure showing the Log-Likelihood vs Time can be found here.

https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf

Let me know if you can not open it.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> I don't think the attachment came through in the list. Could you upload
> the results somewhere and link to them ?
>
>
> On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai  wrote:
>
>> 123 features per rows, and in average, 89% are zeros.
>> On Apr 23, 2014 9:31 PM, "Evan Sparks"  wrote:
>>
>> > What is the number of non zeroes per row (and number of features) in the
>> > sparse case? We've hit some issues with breeze sparse support in the
>> past
>> > but for sufficiently sparse data it's still pretty good.
>> >
>> > > On Apr 23, 2014, at 9:21 PM, DB Tsai  wrote:
>> > >
>> > > Hi all,
>> > >
>> > > I'm benchmarking Logistic Regression in MLlib using the newly added
>> > optimizer LBFGS and GD. I'm using the same dataset and the same
>> methodology
>> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
>> > >
>> > > I want to know how Spark scale while adding workers, and how
>> optimizers
>> > and input format (sparse or dense) impact performance.
>> > >
>> > > The benchmark code can be found here,
>> > https://github.com/dbtsai/spark-lbfgs-benchmark
>> > >
>> > > The first dataset I benchmarked is a9a which only has 2.2MB. I
>> > duplicated the dataset, and made it 762MB to have 11M rows. This dataset
>> > has 123 features and 11% of the data are non-zero elements.
>> > >
>> > > In this benchmark, all the dataset is cached in memory.
>> > >
>> > > As we expect, LBFGS converges faster than GD, and at some point, no
>> > matter how we push GD, it will converge slower and slower.
>> > >
>> > > However, it's surprising that sparse format runs slower than dense
>> > format. I did see that sparse format takes significantly smaller amount
>> of
>> > memory in caching RDD, but sparse is 40% slower than dense. I think
>> sparse
>> > should be fast since when we compute x wT, since x is sparse, we can do
>> it
>> > faster. I wonder if there is anything I'm doing wrong.
>> > >
>> > > The attachment is the benchmark result.
>> > >
>> > > Thanks.
>> > >
>> > > Sincerely,
>> > >
>> > > DB Tsai
>> > > ---
>> > > My Blog: https://www.dbtsai.com
>> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >
>>
>
>


Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread David Hall
On Wed, Apr 23, 2014 at 9:30 PM, Evan Sparks  wrote:

> What is the number of non zeroes per row (and number of features) in the
> sparse case? We've hit some issues with breeze sparse support in the past
> but for sufficiently sparse data it's still pretty good.
>

Any chance you remember what the problems were? I'm sure it could be
better, but it's good to know where improvements need to happen.

-- David


>
> > On Apr 23, 2014, at 9:21 PM, DB Tsai  wrote:
> >
> > Hi all,
> >
> > I'm benchmarking Logistic Regression in MLlib using the newly added
> optimizer LBFGS and GD. I'm using the same dataset and the same methodology
> in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
> >
> > I want to know how Spark scale while adding workers, and how optimizers
> and input format (sparse or dense) impact performance.
> >
> > The benchmark code can be found here,
> https://github.com/dbtsai/spark-lbfgs-benchmark
> >
> > The first dataset I benchmarked is a9a which only has 2.2MB. I
> duplicated the dataset, and made it 762MB to have 11M rows. This dataset
> has 123 features and 11% of the data are non-zero elements.
> >
> > In this benchmark, all the dataset is cached in memory.
> >
> > As we expect, LBFGS converges faster than GD, and at some point, no
> matter how we push GD, it will converge slower and slower.
> >
> > However, it's surprising that sparse format runs slower than dense
> format. I did see that sparse format takes significantly smaller amount of
> memory in caching RDD, but sparse is 40% slower than dense. I think sparse
> should be fast since when we compute x wT, since x is sparse, we can do it
> faster. I wonder if there is anything I'm doing wrong.
> >
> > The attachment is the benchmark result.
> >
> > Thanks.
> >
> > Sincerely,
> >
> > DB Tsai
> > ---
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
>


Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread DB Tsai
Any suggestion for sparser dataset? Will test more tomorrow in the office.
On Apr 23, 2014 9:33 PM, "Evan Sparks"  wrote:

> Sorry - just saw the 11% number. That is around the spot where dense data
> is usually faster (blocking, cache coherence, etc) is there any chance you
> have a 1% (or so) sparse dataset to experiment with?
>
> > On Apr 23, 2014, at 9:21 PM, DB Tsai  wrote:
> >
> > Hi all,
> >
> > I'm benchmarking Logistic Regression in MLlib using the newly added
> optimizer LBFGS and GD. I'm using the same dataset and the same methodology
> in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
> >
> > I want to know how Spark scale while adding workers, and how optimizers
> and input format (sparse or dense) impact performance.
> >
> > The benchmark code can be found here,
> https://github.com/dbtsai/spark-lbfgs-benchmark
> >
> > The first dataset I benchmarked is a9a which only has 2.2MB. I
> duplicated the dataset, and made it 762MB to have 11M rows. This dataset
> has 123 features and 11% of the data are non-zero elements.
> >
> > In this benchmark, all the dataset is cached in memory.
> >
> > As we expect, LBFGS converges faster than GD, and at some point, no
> matter how we push GD, it will converge slower and slower.
> >
> > However, it's surprising that sparse format runs slower than dense
> format. I did see that sparse format takes significantly smaller amount of
> memory in caching RDD, but sparse is 40% slower than dense. I think sparse
> should be fast since when we compute x wT, since x is sparse, we can do it
> faster. I wonder if there is anything I'm doing wrong.
> >
> > The attachment is the benchmark result.
> >
> > Thanks.
> >
> > Sincerely,
> >
> > DB Tsai
> > ---
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
>


Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread Shivaram Venkataraman
I don't think the attachment came through in the list. Could you upload the
results somewhere and link to them ?


On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai  wrote:

> 123 features per rows, and in average, 89% are zeros.
> On Apr 23, 2014 9:31 PM, "Evan Sparks"  wrote:
>
> > What is the number of non zeroes per row (and number of features) in the
> > sparse case? We've hit some issues with breeze sparse support in the past
> > but for sufficiently sparse data it's still pretty good.
> >
> > > On Apr 23, 2014, at 9:21 PM, DB Tsai  wrote:
> > >
> > > Hi all,
> > >
> > > I'm benchmarking Logistic Regression in MLlib using the newly added
> > optimizer LBFGS and GD. I'm using the same dataset and the same
> methodology
> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
> > >
> > > I want to know how Spark scale while adding workers, and how optimizers
> > and input format (sparse or dense) impact performance.
> > >
> > > The benchmark code can be found here,
> > https://github.com/dbtsai/spark-lbfgs-benchmark
> > >
> > > The first dataset I benchmarked is a9a which only has 2.2MB. I
> > duplicated the dataset, and made it 762MB to have 11M rows. This dataset
> > has 123 features and 11% of the data are non-zero elements.
> > >
> > > In this benchmark, all the dataset is cached in memory.
> > >
> > > As we expect, LBFGS converges faster than GD, and at some point, no
> > matter how we push GD, it will converge slower and slower.
> > >
> > > However, it's surprising that sparse format runs slower than dense
> > format. I did see that sparse format takes significantly smaller amount
> of
> > memory in caching RDD, but sparse is 40% slower than dense. I think
> sparse
> > should be fast since when we compute x wT, since x is sparse, we can do
> it
> > faster. I wonder if there is anything I'm doing wrong.
> > >
> > > The attachment is the benchmark result.
> > >
> > > Thanks.
> > >
> > > Sincerely,
> > >
> > > DB Tsai
> > > ---
> > > My Blog: https://www.dbtsai.com
> > > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
>


Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread DB Tsai
123 features per rows, and in average, 89% are zeros.
On Apr 23, 2014 9:31 PM, "Evan Sparks"  wrote:

> What is the number of non zeroes per row (and number of features) in the
> sparse case? We've hit some issues with breeze sparse support in the past
> but for sufficiently sparse data it's still pretty good.
>
> > On Apr 23, 2014, at 9:21 PM, DB Tsai  wrote:
> >
> > Hi all,
> >
> > I'm benchmarking Logistic Regression in MLlib using the newly added
> optimizer LBFGS and GD. I'm using the same dataset and the same methodology
> in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
> >
> > I want to know how Spark scale while adding workers, and how optimizers
> and input format (sparse or dense) impact performance.
> >
> > The benchmark code can be found here,
> https://github.com/dbtsai/spark-lbfgs-benchmark
> >
> > The first dataset I benchmarked is a9a which only has 2.2MB. I
> duplicated the dataset, and made it 762MB to have 11M rows. This dataset
> has 123 features and 11% of the data are non-zero elements.
> >
> > In this benchmark, all the dataset is cached in memory.
> >
> > As we expect, LBFGS converges faster than GD, and at some point, no
> matter how we push GD, it will converge slower and slower.
> >
> > However, it's surprising that sparse format runs slower than dense
> format. I did see that sparse format takes significantly smaller amount of
> memory in caching RDD, but sparse is 40% slower than dense. I think sparse
> should be fast since when we compute x wT, since x is sparse, we can do it
> faster. I wonder if there is anything I'm doing wrong.
> >
> > The attachment is the benchmark result.
> >
> > Thanks.
> >
> > Sincerely,
> >
> > DB Tsai
> > ---
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
>


Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread Evan Sparks
Sorry - just saw the 11% number. That is around the spot where dense data is 
usually faster (blocking, cache coherence, etc) is there any chance you have a 
1% (or so) sparse dataset to experiment with?

> On Apr 23, 2014, at 9:21 PM, DB Tsai  wrote:
> 
> Hi all,
> 
> I'm benchmarking Logistic Regression in MLlib using the newly added optimizer 
> LBFGS and GD. I'm using the same dataset and the same methodology in this 
> paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
> 
> I want to know how Spark scale while adding workers, and how optimizers and 
> input format (sparse or dense) impact performance. 
> 
> The benchmark code can be found here, 
> https://github.com/dbtsai/spark-lbfgs-benchmark
> 
> The first dataset I benchmarked is a9a which only has 2.2MB. I duplicated the 
> dataset, and made it 762MB to have 11M rows. This dataset has 123 features 
> and 11% of the data are non-zero elements. 
> 
> In this benchmark, all the dataset is cached in memory.
> 
> As we expect, LBFGS converges faster than GD, and at some point, no matter 
> how we push GD, it will converge slower and slower. 
> 
> However, it's surprising that sparse format runs slower than dense format. I 
> did see that sparse format takes significantly smaller amount of memory in 
> caching RDD, but sparse is 40% slower than dense. I think sparse should be 
> fast since when we compute x wT, since x is sparse, we can do it faster. I 
> wonder if there is anything I'm doing wrong. 
> 
> The attachment is the benchmark result.
> 
> Thanks.  
> 
> Sincerely,
> 
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai


Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread Evan Sparks
What is the number of non zeroes per row (and number of features) in the sparse 
case? We've hit some issues with breeze sparse support in the past but for 
sufficiently sparse data it's still pretty good. 

> On Apr 23, 2014, at 9:21 PM, DB Tsai  wrote:
> 
> Hi all,
> 
> I'm benchmarking Logistic Regression in MLlib using the newly added optimizer 
> LBFGS and GD. I'm using the same dataset and the same methodology in this 
> paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
> 
> I want to know how Spark scale while adding workers, and how optimizers and 
> input format (sparse or dense) impact performance. 
> 
> The benchmark code can be found here, 
> https://github.com/dbtsai/spark-lbfgs-benchmark
> 
> The first dataset I benchmarked is a9a which only has 2.2MB. I duplicated the 
> dataset, and made it 762MB to have 11M rows. This dataset has 123 features 
> and 11% of the data are non-zero elements. 
> 
> In this benchmark, all the dataset is cached in memory.
> 
> As we expect, LBFGS converges faster than GD, and at some point, no matter 
> how we push GD, it will converge slower and slower. 
> 
> However, it's surprising that sparse format runs slower than dense format. I 
> did see that sparse format takes significantly smaller amount of memory in 
> caching RDD, but sparse is 40% slower than dense. I think sparse should be 
> fast since when we compute x wT, since x is sparse, we can do it faster. I 
> wonder if there is anything I'm doing wrong. 
> 
> The attachment is the benchmark result.
> 
> Thanks.  
> 
> Sincerely,
> 
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai


MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread DB Tsai
Hi all,

I'm benchmarking Logistic Regression in MLlib using the newly added
optimizer LBFGS and GD. I'm using the same dataset and the same methodology
in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf

I want to know how Spark scale while adding workers, and how optimizers and
input format (sparse or dense) impact performance.

The benchmark code can be found here,
https://github.com/dbtsai/spark-lbfgs-benchmark

The first dataset I benchmarked is a9a which only has 2.2MB. I duplicated
the dataset, and made it 762MB to have 11M rows. This dataset has 123
features and 11% of the data are non-zero elements.

In this benchmark, all the dataset is cached in memory.

As we expect, LBFGS converges faster than GD, and at some point, no matter
how we push GD, it will converge slower and slower.

However, it's surprising that sparse format runs slower than dense format.
I did see that sparse format takes significantly smaller amount of memory
in caching RDD, but sparse is 40% slower than dense. I think sparse should
be fast since when we compute x wT, since x is sparse, we can do it faster.
I wonder if there is anything I'm doing wrong.

The attachment is the benchmark result.

Thanks.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


Fw: Is there any way to make a quick test on some pre-commit code?

2014-04-23 Thread Nan Zhu
I’m just asked by others for the same question  

I think Reynold gave a pretty helpful tip on this,  

Shall we put this on Contribute-to-Spark wiki?  

--  
Nan Zhu


Forwarded message:

> From: Reynold Xin 
> Reply To: d...@spark.incubator.apache.org
> To: d...@spark.incubator.apache.org 
> Date: Thursday, February 6, 2014 at 7:50:57 PM
> Subject: Re: Is there any way to make a quick test on some pre-commit code?
>  
> You can do
>  
> sbt/sbt assemble-deps
>  
>  
> and then just run
>  
> sbt/sbt package
>  
> each time.
>  
>  
> You can even do
>  
> sbt/sbt ~package
>  
> for automatic incremental compilation.
>  
>  
>  
> On Thu, Feb 6, 2014 at 4:46 PM, Nan Zhu  (mailto:zhunanmcg...@gmail.com)> wrote:
>  
> > Hi, all
> >  
> > Is it always necessary to run sbt assembly when you want to test some code,
> >  
> > Sometimes you just repeatedly change one or two lines for some failed test
> > case, it is really time-consuming to sbt assembly every time
> >  
> > any faster way?
> >  
> > Best,
> >  
> > --
> > Nan Zhu
> >  
>  
>  
>  
>  




Re: [jira] [Commented] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-04-23 Thread Nishkam Ravi
It would probably be best to retain support for SPARK_JAVA_OPTS in
ClientBase though..for developers that may have been using it.


On Wed, Apr 23, 2014 at 6:26 PM, Nishkam Ravi  wrote:

> Bit of a race condition here it seems. Patrick made a few changes
> yesterday around the same time as I did (in ClientBase.scala):
>
> for ((k, v) <- sys.props.filterKeys(_.startsWith("spark")))
> { JAVA_OPTS += "-D" + k + "=" + "\\\"" + v + "\\\"" }
>
> This would allow JAVA_OPTS to be passed on the command line to the
> ApplicationMaster, and accomplishes the same things as creation of a new
> command line flag --spark-java-opts.
>
>
> Mridul, the use of SPARK_JAVA_OPTS has been intentionally suppressed.
>
>
> On Wed, Apr 23, 2014 at 10:54 AM, Mridul Muralidharan wrote:
>
>> Sorry, I misread - I meant SPARK_JAVA_OPTS - not JAVA_OPTS.
>> See here : https://issues.apache.org/jira/browse/SPARK-1588
>>
>> Regards,
>> Mridul
>>
>> On Wed, Apr 23, 2014 at 6:37 PM, Mridul Muralidharan 
>> wrote:
>> > This breaks all existing jobs which are not using spark-submit.
>> > The consensus was not to break compatibility unless there was an
>> overriding
>> > reason to do so
>> >
>> > On Apr 23, 2014 6:32 PM, "Thomas Graves (JIRA)" 
>> wrote:
>> >>
>> >>
>> >> [
>> >>
>> https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13978164#comment-13978164
>> >> ]
>> >>
>> >> Thomas Graves commented on SPARK-1576:
>> >> --
>> >>
>> >> Is this meant for the driver or the executors?  The spark-submit script
>> >> has a command line option for the driver:  --driver-java-options.
>> >> I believe the intent of https://github.com/apache/spark/pull/299 was
>> to
>> >> not expose SPARK_JAVA_OPTS to the user anymore.
>> >>
>> >> > Passing of JAVA_OPTS to YARN on command line
>> >> > 
>> >> >
>> >> > Key: SPARK-1576
>> >> > URL:
>> https://issues.apache.org/jira/browse/SPARK-1576
>> >> > Project: Spark
>> >> >  Issue Type: Improvement
>> >> >Affects Versions: 0.9.0, 1.0.0, 0.9.1
>> >> >Reporter: Nishkam Ravi
>> >> > Fix For: 0.9.0, 1.0.0, 0.9.1
>> >> >
>> >> > Attachments: SPARK-1576.patch
>> >> >
>> >> >
>> >> > JAVA_OPTS can be passed by using either env variables (i.e.,
>> >> > SPARK_JAVA_OPTS) or as config vars (after Patrick's recent change).
>> It would
>> >> > be good to allow the user to pass them on command line as well to
>> restrict
>> >> > scope to single application invocation.
>> >>
>> >>
>> >>
>> >> --
>> >> This message was sent by Atlassian JIRA
>> >> (v6.2#6252)
>>
>
>


Re: [jira] [Commented] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-04-23 Thread Nishkam Ravi
Bit of a race condition here it seems. Patrick made a few changes yesterday
around the same time as I did (in ClientBase.scala):

for ((k, v) <- sys.props.filterKeys(_.startsWith("spark")))
{ JAVA_OPTS += "-D" + k + "=" + "\\\"" + v + "\\\"" }

This would allow JAVA_OPTS to be passed on the command line to the
ApplicationMaster, and accomplishes the same things as creation of a new
command line flag --spark-java-opts.


Mridul, the use of SPARK_JAVA_OPTS has been intentionally suppressed.


On Wed, Apr 23, 2014 at 10:54 AM, Mridul Muralidharan wrote:

> Sorry, I misread - I meant SPARK_JAVA_OPTS - not JAVA_OPTS.
> See here : https://issues.apache.org/jira/browse/SPARK-1588
>
> Regards,
> Mridul
>
> On Wed, Apr 23, 2014 at 6:37 PM, Mridul Muralidharan 
> wrote:
> > This breaks all existing jobs which are not using spark-submit.
> > The consensus was not to break compatibility unless there was an
> overriding
> > reason to do so
> >
> > On Apr 23, 2014 6:32 PM, "Thomas Graves (JIRA)"  wrote:
> >>
> >>
> >> [
> >>
> https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13978164#comment-13978164
> >> ]
> >>
> >> Thomas Graves commented on SPARK-1576:
> >> --
> >>
> >> Is this meant for the driver or the executors?  The spark-submit script
> >> has a command line option for the driver:  --driver-java-options.
> >> I believe the intent of https://github.com/apache/spark/pull/299 was to
> >> not expose SPARK_JAVA_OPTS to the user anymore.
> >>
> >> > Passing of JAVA_OPTS to YARN on command line
> >> > 
> >> >
> >> > Key: SPARK-1576
> >> > URL: https://issues.apache.org/jira/browse/SPARK-1576
> >> > Project: Spark
> >> >  Issue Type: Improvement
> >> >Affects Versions: 0.9.0, 1.0.0, 0.9.1
> >> >Reporter: Nishkam Ravi
> >> > Fix For: 0.9.0, 1.0.0, 0.9.1
> >> >
> >> > Attachments: SPARK-1576.patch
> >> >
> >> >
> >> > JAVA_OPTS can be passed by using either env variables (i.e.,
> >> > SPARK_JAVA_OPTS) or as config vars (after Patrick's recent change).
> It would
> >> > be good to allow the user to pass them on command line as well to
> restrict
> >> > scope to single application invocation.
> >>
> >>
> >>
> >> --
> >> This message was sent by Atlassian JIRA
> >> (v6.2#6252)
>


Re: get -101 error code when running select query

2014-04-23 Thread Madhu
I have seen a similar error message when connecting to Hive through JDBC.
This is just a guess on my part, but check your query. The error occurs if
you have a select that includes a null literal with an alias like this:

select a, b, null as c, d from foo

In my case, rewriting the query to use an empty string or other literal
instead of null worked:

select a, b, '' as c, d from foo

I think the problem is the lack of type information when supplying a null
literal.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/get-101-error-code-when-running-select-query-tp6377p6382.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.


Re: ArrayIndexOutOfBoundsException in ALS.implicit

2014-04-23 Thread Xiangrui Meng
Hi bearrito, this issue was fixed by Tor in
https://github.com/apache/spark/pull/407. You can either try the
master branch or wait for the 1.0 release. -Xiangrui

On Fri, Mar 28, 2014 at 12:19 AM, Xiangrui Meng  wrote:
> Hi bearrito,
>
> This is a known issue
> (https://spark-project.atlassian.net/browse/SPARK-1281) and it should
> be easy to fix by switching to a hash partitioner.
>
> CC'ed dev list in case someone volunteers to work on it.
>
> Best,
> Xiangrui
>
> On Thu, Mar 27, 2014 at 8:38 PM, bearrito  
> wrote:
>> Usage of negative product id's causes the above exception.
>>
>> The cause is the use of the product id's as a mechanism to index into the
>> the in and out block structures.
>>
>> Specifically on 9.0 it occurs at
>> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$makeInLinkBlock$2.apply(ALS.scala:262)
>>
>> It seems reasonable to expect that product id's are positive, if a bit
>> opinionated.  I ran across this because the hash function I was using on my
>> product id's includes the negatives in it's range.
>>
>>
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/ArrayIndexOutOfBoundsException-in-ALS-implicit-tp3400.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: [jira] [Commented] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-04-23 Thread Mridul Muralidharan
Sorry, I misread - I meant SPARK_JAVA_OPTS - not JAVA_OPTS.
See here : https://issues.apache.org/jira/browse/SPARK-1588

Regards,
Mridul

On Wed, Apr 23, 2014 at 6:37 PM, Mridul Muralidharan  wrote:
> This breaks all existing jobs which are not using spark-submit.
> The consensus was not to break compatibility unless there was an overriding
> reason to do so
>
> On Apr 23, 2014 6:32 PM, "Thomas Graves (JIRA)"  wrote:
>>
>>
>> [
>> https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13978164#comment-13978164
>> ]
>>
>> Thomas Graves commented on SPARK-1576:
>> --
>>
>> Is this meant for the driver or the executors?  The spark-submit script
>> has a command line option for the driver:  --driver-java-options.
>> I believe the intent of https://github.com/apache/spark/pull/299 was to
>> not expose SPARK_JAVA_OPTS to the user anymore.
>>
>> > Passing of JAVA_OPTS to YARN on command line
>> > 
>> >
>> > Key: SPARK-1576
>> > URL: https://issues.apache.org/jira/browse/SPARK-1576
>> > Project: Spark
>> >  Issue Type: Improvement
>> >Affects Versions: 0.9.0, 1.0.0, 0.9.1
>> >Reporter: Nishkam Ravi
>> > Fix For: 0.9.0, 1.0.0, 0.9.1
>> >
>> > Attachments: SPARK-1576.patch
>> >
>> >
>> > JAVA_OPTS can be passed by using either env variables (i.e.,
>> > SPARK_JAVA_OPTS) or as config vars (after Patrick's recent change). It 
>> > would
>> > be good to allow the user to pass them on command line as well to restrict
>> > scope to single application invocation.
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)


Re: [jira] [Commented] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-04-23 Thread Tom Graves
can you be more specific?  What breaks existing jobs?  If you are referring to 
my comment,  SPARK_JAVA_OPTS still works but I think the intent is to move away 
from it.

Tom
On Wednesday, April 23, 2014 8:07 AM, Mridul Muralidharan  
wrote:
 
This breaks all existing jobs which are not using spark-submit.
The consensus was not to break compatibility unless there was an overriding
reason to do so

On Apr 23, 2014 6:32 PM, "Thomas Graves (JIRA)"  wrote:

>
>     [
> https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13978164#comment-13978164]
>
> Thomas Graves commented on SPARK-1576:
> --
>
> Is this meant for the driver or the executors?  The spark-submit script
> has a command line option for the driver:  --driver-java-options.
> I believe the intent of https://github.com/apache/spark/pull/299 was to
> not expose SPARK_JAVA_OPTS to the user anymore.
>
> > Passing of JAVA_OPTS to YARN on command line
> > 
> >
> >                 Key: SPARK-1576
> >                 URL: https://issues.apache.org/jira/browse/SPARK-1576
> >             Project: Spark
> >          Issue Type: Improvement
> >    Affects Versions: 0.9.0, 1.0.0, 0.9.1
> >            Reporter: Nishkam Ravi
> >             Fix For: 0.9.0, 1.0.0, 0.9.1
> >
> >         Attachments: SPARK-1576.patch
> >
> >
> > JAVA_OPTS can be passed by using either env variables (i.e.,
> SPARK_JAVA_OPTS) or as config vars (after Patrick's recent change). It
> would be good to allow the user to pass them on command line as well to
> restrict scope to single application invocation.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>

Re: [jira] [Commented] (SPARK-1576) Passing of JAVA_OPTS to YARN on command line

2014-04-23 Thread Mridul Muralidharan
This breaks all existing jobs which are not using spark-submit.
The consensus was not to break compatibility unless there was an overriding
reason to do so
On Apr 23, 2014 6:32 PM, "Thomas Graves (JIRA)"  wrote:

>
> [
> https://issues.apache.org/jira/browse/SPARK-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13978164#comment-13978164]
>
> Thomas Graves commented on SPARK-1576:
> --
>
> Is this meant for the driver or the executors?  The spark-submit script
> has a command line option for the driver:  --driver-java-options.
> I believe the intent of https://github.com/apache/spark/pull/299 was to
> not expose SPARK_JAVA_OPTS to the user anymore.
>
> > Passing of JAVA_OPTS to YARN on command line
> > 
> >
> > Key: SPARK-1576
> > URL: https://issues.apache.org/jira/browse/SPARK-1576
> > Project: Spark
> >  Issue Type: Improvement
> >Affects Versions: 0.9.0, 1.0.0, 0.9.1
> >Reporter: Nishkam Ravi
> > Fix For: 0.9.0, 1.0.0, 0.9.1
> >
> > Attachments: SPARK-1576.patch
> >
> >
> > JAVA_OPTS can be passed by using either env variables (i.e.,
> SPARK_JAVA_OPTS) or as config vars (after Patrick's recent change). It
> would be good to allow the user to pass them on command line as well to
> restrict scope to single application invocation.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
>


get -101 error code when running select query

2014-04-23 Thread qingyang li
hi,  i have started one sharkserver2 ,  and using java code to send query
to this server by hive jdbc,  but i got such error:
--
FAILED: Execution Error, return code -101 from shark.execution.SparkTask
org.apache.hive.service.cli.HiveSQLException: Error while processing
statement: FAILED: Execution Error, return code -101 from
shark.execution.SparkTask
at shark.server.SharkSQLOperation.run(SharkSQLOperation.scala:45)
at
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:180)
at
org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:152)
at
org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:203)
at
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133)
at
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118)
at
org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at
org.apache.hive.service.auth.TUGIContainingProcessor$2.run(TUGIContainingProcessor.java:64)
at
org.apache.hive.service.auth.TUGIContainingProcessor$2.run(TUGIContainingProcessor.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:524)
at
org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:61)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

---
do anyone encounter this problem?


Re: Spark on wikipedia dataset

2014-04-23 Thread Mayur Rustagi
Huge joins would be interesting. I do all my demos on wikipedia dataset for
Shark. Joins are typical pain to showcase & show off :)

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Wed, Apr 23, 2014 at 10:33 AM, Ajay Nair  wrote:

> I am going to perform some test experiments on the wikipedia dataset using
> the spark framework. I know wikipedia data set might already have been
> analyzed, but what are the potential explored/unexplored aspects of spark
> that can be tested and benchmarked on wikipedia dataset?
>
> Thanks
> AJ
>


Sharing RDDs

2014-04-23 Thread Saumitra Shahapure (Vizury)
Hello,

Is it possible in spark to reuse cached RDDs generated in earlier run?

Specifically, I am trying to have a setup where first scala script
generates cached RDDs. If another scala script tries to perform same
operations on same dataset, it should be able to get results from cache
generated in earlier run.

Is there any direct/indirect way to do this?

--
Regards,
Saumitra Shahapure