[jira] [Closed] (SPARK-4680) Add support for no-op compression

2014-12-02 Thread Victor Tso (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victor Tso closed SPARK-4680.
-
Resolution: Not a Problem

spark.broadcast.compress
spark.rdd.compress
spark.shuffle.compress

These properties are sufficient to enable/disable compression.

> Add support for no-op compression
> -
>
> Key: SPARK-4680
> URL: https://issues.apache.org/jira/browse/SPARK-4680
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.1
>Reporter: Victor Tso
>
> Specifically for quantifying performance gains of a given compression codec, 
> no-op allows the deployed application to be profiled with and without a 
> specific codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4644) Implement skewed join

2014-12-02 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231161#comment-14231161
 ] 

Lianhui Wang commented on SPARK-4644:
-

Shixiong Zhu yes, i agree with you. i will take a look at your PR.

> Implement skewed join
> -
>
> Key: SPARK-4644
> URL: https://issues.apache.org/jira/browse/SPARK-4644
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
> Attachments: Skewed Join Design Doc.pdf
>
>
> Skewed data is not rare. For example, a book recommendation site may have 
> several books which are liked by most of the users. Running ALS on such 
> skewed data will raise a OutOfMemory error, if some book has too many users 
> which cannot be fit into memory. To solve it, we propose a skewed join 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3638) Commons HTTP client dependency conflict in extras/kinesis-asl module

2014-12-02 Thread A.K.M. Ashrafuzzaman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231168#comment-14231168
 ] 

A.K.M. Ashrafuzzaman commented on SPARK-3638:
-

[~aniket] Yes you are right. I did not know that to use Kinesis I have to build 
the spark from source. I thought I can run that on pre build packages.

> Commons HTTP client dependency conflict in extras/kinesis-asl module
> 
>
> Key: SPARK-3638
> URL: https://issues.apache.org/jira/browse/SPARK-3638
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Streaming
>Affects Versions: 1.1.0
>Reporter: Aniket Bhatnagar
>  Labels: dependencies
> Fix For: 1.1.1, 1.2.0
>
>
> Followed instructions as mentioned @ 
> https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md
>  and when running the example, I get the following error:
> {code}
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.http.impl.conn.DefaultClientConnectionOperator.(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V
> at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.createConnectionOperator(PoolingClientConnectionManager.java:140)
> at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:114)
> at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:99)
> at 
> com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:29)
> at 
> com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:97)
> at 
> com.amazonaws.http.AmazonHttpClient.(AmazonHttpClient.java:181)
> at 
> com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:119)
> at 
> com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:103)
> at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.(AmazonKinesisClient.java:136)
> at 
> com.amazonaws.services.kinesis.AmazonKinesisClient.(AmazonKinesisClient.java:117)
> at 
> com.amazonaws.services.kinesis.AmazonKinesisAsyncClient.(AmazonKinesisAsyncClient.java:132)
> {code}
> I believe this is due to the dependency conflict as described @ 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201409.mbox/%3ccajob8btdxks-7-spjj5jmnw0xsnrjwdpcqqtjht1hun6j4z...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4644) Implement skewed join

2014-12-02 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231161#comment-14231161
 ] 

Lianhui Wang edited comment on SPARK-4644 at 12/2/14 8:48 AM:
--

[~zsxwing] yes, i agree with you. i will take a look at your PR.


was (Author: lianhuiwang):
@Shixiong Zhu yes, i agree with you. i will take a look at your PR.

> Implement skewed join
> -
>
> Key: SPARK-4644
> URL: https://issues.apache.org/jira/browse/SPARK-4644
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
> Attachments: Skewed Join Design Doc.pdf
>
>
> Skewed data is not rare. For example, a book recommendation site may have 
> several books which are liked by most of the users. Running ALS on such 
> skewed data will raise a OutOfMemory error, if some book has too many users 
> which cannot be fit into memory. To solve it, we propose a skewed join 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4644) Implement skewed join

2014-12-02 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231161#comment-14231161
 ] 

Lianhui Wang edited comment on SPARK-4644 at 12/2/14 8:48 AM:
--

@Shixiong Zhu yes, i agree with you. i will take a look at your PR.


was (Author: lianhuiwang):
Shixiong Zhu yes, i agree with you. i will take a look at your PR.

> Implement skewed join
> -
>
> Key: SPARK-4644
> URL: https://issues.apache.org/jira/browse/SPARK-4644
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
> Attachments: Skewed Join Design Doc.pdf
>
>
> Skewed data is not rare. For example, a book recommendation site may have 
> several books which are liked by most of the users. Running ALS on such 
> skewed data will raise a OutOfMemory error, if some book has too many users 
> which cannot be fit into memory. To solve it, we propose a skewed join 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4691) code optimization for judgement

2014-12-02 Thread maji2014 (JIRA)
maji2014 created SPARK-4691:
---

 Summary: code optimization for judgement
 Key: SPARK-4691
 URL: https://issues.apache.org/jira/browse/SPARK-4691
 Project: Spark
  Issue Type: Bug
Reporter: maji2014
Priority: Minor


aggregator and mapSideCombine judgement in 
HashShuffleWriter.scala 
SortShuffleWriter.scala
HashShuffleReader.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering

2014-12-02 Thread Meethu Mathew (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231226#comment-14231226
 ] 

Meethu Mathew commented on SPARK-4156:
--

We had run the GMM code on two public datasets :
 http://cs.joensuu.fi/sipu/datasets/s1.txt 
 http://cs.joensuu.fi/sipu/datasets/birch2.txt 

It was observed in both the cases that the execution converged at the 3rd 
iteration and the w , mu and sigma were identical for all the components.The 
code was run using the following commands:
./bin/run-example org.apache.spark.examples.mllib.DenseGmmEM s1.csv 15 .0001
./bin/run-example org.apache.spark.examples.mllib.DenseGmmEM birch2.csv 100 
.0001

Are we missing something here?

> Add expectation maximization for Gaussian mixture models to MLLib clustering
> 
>
> Key: SPARK-4156
> URL: https://issues.apache.org/jira/browse/SPARK-4156
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Travis Galoppo
>Assignee: Travis Galoppo
>
> As an additional clustering algorithm, implement expectation maximization for 
> Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4691) code optimization for judgement

2014-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231243#comment-14231243
 ] 

Apache Spark commented on SPARK-4691:
-

User 'maji2014' has created a pull request for this issue:
https://github.com/apache/spark/pull/3553

> code optimization for judgement
> ---
>
> Key: SPARK-4691
> URL: https://issues.apache.org/jira/browse/SPARK-4691
> Project: Spark
>  Issue Type: Bug
>Reporter: maji2014
>Priority: Minor
>
> aggregator and mapSideCombine judgement in 
> HashShuffleWriter.scala 
> SortShuffleWriter.scala
> HashShuffleReader.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4685) Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections

2014-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231275#comment-14231275
 ] 

Apache Spark commented on SPARK-4685:
-

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/3554

> Update JavaDoc settings to include spark.ml and all spark.mllib subpackages 
> in the right sections
> -
>
> Key: SPARK-4685
> URL: https://issues.apache.org/jira/browse/SPARK-4685
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation
>Reporter: Matei Zaharia
>Priority: Trivial
>
> Right now they're listed under "other packages" on the homepage of the 
> JavaDoc docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4692) Support ! boolean logic operator like NOT

2014-12-02 Thread YanTang Zhai (JIRA)
YanTang Zhai created SPARK-4692:
---

 Summary: Support ! boolean logic operator like NOT
 Key: SPARK-4692
 URL: https://issues.apache.org/jira/browse/SPARK-4692
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: YanTang Zhai
Priority: Minor


select * from for_test where !(col1 > col2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4692) Support ! boolean logic operator like NOT

2014-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231308#comment-14231308
 ] 

Apache Spark commented on SPARK-4692:
-

User 'YanTangZhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/3555

> Support ! boolean logic operator like NOT
> -
>
> Key: SPARK-4692
> URL: https://issues.apache.org/jira/browse/SPARK-4692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: YanTang Zhai
>Priority: Minor
>
> select * from for_test where !(col1 > col2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-12-02 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231375#comment-14231375
 ] 

Valeriy Avanesov edited comment on SPARK-2426 at 12/2/14 11:47 AM:
---

I'm not sure if I understand your question...

As far as I can see, w_i stands for a row of the matrix w and h_j stands for a 
column of the matrix h.  

\sum_i \sum_j ( r_ij - w_i*h_j) -- is not a matrix norm. Probably, you either 
miss abs or square -- \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - 
w_i*h_j)^2
It looks like l2 regularized stochastic matrix decomposition with respect to 
Frobenius (or l1) norm. But I don't understand why do you consider k 
optimization problems (do you? What does k \in {1 ... 25} stand for?). 

Anyway, l2 regularized stochastic matrix decomposition problem is defined as 
follows 

Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints. 

  ||.|| stands for Frobenius norm (or l1). 

By the way: is the matrix of ranks r stochastic? Stochastic matrix 
decomposition doesn't seem reasonable if it's not. 


was (Author: acopich):
I'm not sure if I understand your question...

As far as I can see, w_i stands for a row of the matrix w and h_j stands for a 
column of the matrix h.  

\sum_i \sum_j ( r_ij - w_i*h_j) -- is not a matrix norm. Probably, you either 
miss abs or square -- \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - 
w_i*h_j)^2
It looks like l2 regularized stochastic matrix decomposition with respect to 
Frobenius (or l1) norm. But I don't understand why do you consider k 
optimization problems (do you? What does k \in {1 ... 25} stand for?). 

Anyway, l2 regularized stochastic matrix decomposition problem is defined as 
follows 

Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints. 

||..|| stands for Frobenius norm (or l1). 

By the way: is the matrix of ranks r stochastic? Stochastic matrix 
decomposition doesn't seem reasonable if it's not. 

> Quadratic Minimization for MLlib ALS
> 
>
> Key: SPARK-2426
> URL: https://issues.apache.org/jira/browse/SPARK-2426
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Debasish Das
>Assignee: Debasish Das
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Current ALS supports least squares and nonnegative least squares.
> I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
> the following ALS problems:
> 1. ALS with bounds
> 2. ALS with L1 regularization
> 3. ALS with Equality constraint and bounds
> Initial runtime comparisons are presented at Spark Summit. 
> http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
> Based on Xiangrui's feedback I am currently comparing the ADMM based 
> Quadratic Minimization solvers with IPM based QpSolvers and the default 
> ALS/NNLS. I will keep updating the runtime comparison results.
> For integration the detailed plan is as follows:
> 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
> 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-12-02 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231375#comment-14231375
 ] 

Valeriy Avanesov commented on SPARK-2426:
-

I'm not sure if I understand your question...

As far as I can see, w_i stands for a row of the matrix w and h_j stands for a 
column of the matrix h.  

\sum_i \sum_j ( r_ij - w_i*h_j) -- is not a matrix norm. Probably, you either 
miss abs or square -- \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - 
w_i*h_j)^2
It looks like l2 regularized stochastic matrix decomposition with respect to 
Frobenius (or l1) norm. But I don't understand why do you consider k 
optimization problems (do you? What does k \in {1 ... 25} stand for?). 

Anyway, l2 regularized stochastic matrix decomposition problem is defined as 
follows 

Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints. 

||..|| stands for Frobenius norm (or l1). 

By the way: is the matrix of ranks r stochastic? Stochastic matrix 
decomposition doesn't seem reasonable if it's not. 

> Quadratic Minimization for MLlib ALS
> 
>
> Key: SPARK-2426
> URL: https://issues.apache.org/jira/browse/SPARK-2426
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Debasish Das
>Assignee: Debasish Das
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Current ALS supports least squares and nonnegative least squares.
> I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
> the following ALS problems:
> 1. ALS with bounds
> 2. ALS with L1 regularization
> 3. ALS with Equality constraint and bounds
> Initial runtime comparisons are presented at Spark Summit. 
> http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
> Based on Xiangrui's feedback I am currently comparing the ADMM based 
> Quadratic Minimization solvers with IPM based QpSolvers and the default 
> ALS/NNLS. I will keep updating the runtime comparison results.
> For integration the detailed plan is as follows:
> 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
> 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-12-02 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231375#comment-14231375
 ] 

Valeriy Avanesov edited comment on SPARK-2426 at 12/2/14 11:47 AM:
---

I'm not sure if I understand your question...

As far as I can see, w_i stands for a row of the matrix w and h_j stands for a 
column of the matrix h.  

\sum_i \sum_j ( r_ij - w_i*h_j) -- is not a matrix norm. Probably, you either 
miss abs or square -- \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - 
w_i*h_j)^2
It looks like l2 regularized stochastic matrix decomposition with respect to 
Frobenius (or l1) norm. But I don't understand why do you consider k 
optimization problems (do you? What does k \in {1 ... 25} stand for?). 

Anyway, l2 regularized stochastic matrix decomposition problem is defined as 
follows 

Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints. 

  \||.|| stands for Frobenius norm (or l1). 

By the way: is the matrix of ranks r stochastic? Stochastic matrix 
decomposition doesn't seem reasonable if it's not. 


was (Author: acopich):
I'm not sure if I understand your question...

As far as I can see, w_i stands for a row of the matrix w and h_j stands for a 
column of the matrix h.  

\sum_i \sum_j ( r_ij - w_i*h_j) -- is not a matrix norm. Probably, you either 
miss abs or square -- \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - 
w_i*h_j)^2
It looks like l2 regularized stochastic matrix decomposition with respect to 
Frobenius (or l1) norm. But I don't understand why do you consider k 
optimization problems (do you? What does k \in {1 ... 25} stand for?). 

Anyway, l2 regularized stochastic matrix decomposition problem is defined as 
follows 

Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints. 

  ||.|| stands for Frobenius norm (or l1). 

By the way: is the matrix of ranks r stochastic? Stochastic matrix 
decomposition doesn't seem reasonable if it's not. 

> Quadratic Minimization for MLlib ALS
> 
>
> Key: SPARK-2426
> URL: https://issues.apache.org/jira/browse/SPARK-2426
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Debasish Das
>Assignee: Debasish Das
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> Current ALS supports least squares and nonnegative least squares.
> I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
> the following ALS problems:
> 1. ALS with bounds
> 2. ALS with L1 regularization
> 3. ALS with Equality constraint and bounds
> Initial runtime comparisons are presented at Spark Summit. 
> http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
> Based on Xiangrui's feedback I am currently comparing the ADMM based 
> Quadratic Minimization solvers with IPM based QpSolvers and the default 
> ALS/NNLS. I will keep updating the runtime comparison results.
> For integration the detailed plan is as follows:
> 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
> 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4693) PruningPredicates may be wrong if predicates contains an empty AttributeSet() references

2014-12-02 Thread YanTang Zhai (JIRA)
YanTang Zhai created SPARK-4693:
---

 Summary: PruningPredicates may be wrong if predicates contains an 
empty AttributeSet() references
 Key: SPARK-4693
 URL: https://issues.apache.org/jira/browse/SPARK-4693
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: YanTang Zhai
Priority: Minor


The sql "select * from spark_test::for_test where abs(20141202) is not null" 
has predicates=List(IS NOT NULL 
HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)) and 
partitionKeyIds=AttributeSet(). PruningPredicates is List(IS NOT NULL 
HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)). Then the 
exception "java.lang.IllegalArgumentException: requirement failed: Partition 
pruning predicates only supported for partitioned tables." is thrown.
The sql "select * from spark_test::for_test_partitioned_table where 
abs(20141202) is not null and type_id=11 and platform = 3" with partitioned key 
insert_date has predicates=List(IS NOT NULL 
HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202), (type_id#12 = 
11), (platform#8 = 3)) and partitionKeyIds=AttributeSet(insert_date#24). 
PruningPredicates is List(IS NOT NULL 
HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4693) PruningPredicates may be wrong if predicates contains an empty AttributeSet() references

2014-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231410#comment-14231410
 ] 

Apache Spark commented on SPARK-4693:
-

User 'YanTangZhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/3556

> PruningPredicates may be wrong if predicates contains an empty AttributeSet() 
> references
> 
>
> Key: SPARK-4693
> URL: https://issues.apache.org/jira/browse/SPARK-4693
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: YanTang Zhai
>Priority: Minor
>
> The sql "select * from spark_test::for_test where abs(20141202) is not null" 
> has predicates=List(IS NOT NULL 
> HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)) and 
> partitionKeyIds=AttributeSet(). PruningPredicates is List(IS NOT NULL 
> HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)). Then the 
> exception "java.lang.IllegalArgumentException: requirement failed: Partition 
> pruning predicates only supported for partitioned tables." is thrown.
> The sql "select * from spark_test::for_test_partitioned_table where 
> abs(20141202) is not null and type_id=11 and platform = 3" with partitioned 
> key insert_date has predicates=List(IS NOT NULL 
> HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202), (type_id#12 = 
> 11), (platform#8 = 3)) and partitionKeyIds=AttributeSet(insert_date#24). 
> PruningPredicates is List(IS NOT NULL 
> HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFAbs(20141202)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4694) Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in yarn-client mode

2014-12-02 Thread SaintBacchus (JIRA)
SaintBacchus created SPARK-4694:
---

 Summary: Long-run user thread(such as HiveThriftServer2) causes 
the 'process leak' in yarn-client mode
 Key: SPARK-4694
 URL: https://issues.apache.org/jira/browse/SPARK-4694
 Project: Spark
  Issue Type: Bug
Reporter: SaintBacchus


Recently when I use the Yarn HA mode to test the HiveThriftServer2 I found a 
problem that the driver can't exit by itself.
To reappear it, you can do as fellow:
1.use yarn HA mode and set am.maxAttemp = 1for convenience
2.kill the active resouce manager in cluster

The expect result is just failed, because the maxAttemp was 1.

But the actual result is that: all executor was ended but the driver was still 
there and never close.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4694) Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in yarn-client mode

2014-12-02 Thread SaintBacchus (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-4694:

Component/s: YARN

> Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in 
> yarn-client mode
> -
>
> Key: SPARK-4694
> URL: https://issues.apache.org/jira/browse/SPARK-4694
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: SaintBacchus
>
> Recently when I use the Yarn HA mode to test the HiveThriftServer2 I found a 
> problem that the driver can't exit by itself.
> To reappear it, you can do as fellow:
> 1.use yarn HA mode and set am.maxAttemp = 1for convenience
> 2.kill the active resouce manager in cluster
> The expect result is just failed, because the maxAttemp was 1.
> But the actual result is that: all executor was ended but the driver was 
> still there and never close.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4694) Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in yarn-client mode

2014-12-02 Thread SaintBacchus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231440#comment-14231440
 ] 

SaintBacchus commented on SPARK-4694:
-

The reason was that Yarn had reported the status to the RM and the 
YarnClientSchedulerBackend would detect the status to stop sc in function 
'asyncMonitorApplication'.
But the HiveThriftServer2 is a long-run user thread. JVM will never exit until 
all the no-demo threads have ended or using System.exit().
It cause such problem.
The easiest way to reslove this problem is using System.exit(0) instead of 
sc.stop in funciton 'asyncMonitorApplication' .
But system.exit is not recommended in 
https://issues.apache.org/jira/browse/SPARK-4584
Do you have any ideas about this problem? [~vanzin]

> Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in 
> yarn-client mode
> -
>
> Key: SPARK-4694
> URL: https://issues.apache.org/jira/browse/SPARK-4694
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: SaintBacchus
>
> Recently when I use the Yarn HA mode to test the HiveThriftServer2 I found a 
> problem that the driver can't exit by itself.
> To reappear it, you can do as fellow:
> 1.use yarn HA mode and set am.maxAttemp = 1for convenience
> 2.kill the active resouce manager in cluster
> The expect result is just failed, because the maxAttemp was 1.
> But the actual result is that: all executor was ended but the driver was 
> still there and never close.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering

2014-12-02 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231474#comment-14231474
 ] 

Travis Galoppo commented on SPARK-4156:
---

Ok, I looked into this.  This is the result of using unit covariance matrices 
for initialization; specifically, the numbers in the input files are quite 
large, and [more importantly, I reckon] vary by relatively large amounts, thus 
the initial unit covariance matrices are poor choices, driving the 
probabilities to ~zero.

I tested the S1 dataset after scaling the inputs by 10, and the algorithm 
yielded:

w=0.018651 mu=[1.4005351951422986,5.560161272092209] sigma=
0.0047916181666818325  1.8492627979416199E-4  
1.8492627979416199E-4  0.011135224999325288   

w=0.070139 mu=[3.9826648305512444,4.048416241679408] sigma=
0.08975122201635877   0.011161215961635662  
0.011161215961635662  0.07281211382882091   

w=0.203390 mu=[4.50966114011736,8.335671907946685] sigma=
3.3435755029681820.16780915524083184  
0.16780915524083184  0.1983579752119624   

w=0.061357 mu=[8.243819479262187,7.299054596484072] sigma=
0.059502423358168244  -0.01288330287962225  
-0.01288330287962225  0.08306975793088611   

w=0.068116 mu=[3.2082470765623987,1.6153321811600052] sigma=
0.13661341675065408-0.004671801905049122  
-0.004671801905049122  0.1184668732856653 

w=0.015480 mu=[6.032605151728542,5.76477595221249] sigma=
0.006257088363533114  -0.01541684245322017  
-0.01541684245322017  0.11177862390275095   

w=0.069246 mu=[8.599898790732793,5.47222558625928] sigma=
0.083345775599170220.0025980740480378017  
0.0025980740480378017  0.10560039597455884

w=0.066601 mu=[1.675642401646793,3.4768887461230293] sigma=
0.06718419616465754-0.001992742042064677  
-0.001992742042064677  0.08394612669156842

w=0.050884 mu=[1.4034421425114039,5.586799889184816] sigma=
0.18839808914440148-0.017016991559440697  
-0.017016991559440697  0.09967868623594711

w=0.067257 mu=[6.180341749904763,3.9855165348399026] sigma=
0.111625017355422070.0023201319648720187  
0.0023201319648720187  0.09177325542363057

w=0.070096 mu=[5.078726203553804,1.756463619639961] sigma=
0.07852242299631484  0.03291628699789406  
0.03291628699789406  0.08050080528055803  

w=0.015951 mu=[5.989248184898113,5.729903049835485] sigma=
0.06204977226748554   0.008716828781302866  
0.008716828781302866  0.003116768910125245  

w=0.128860 mu=[8.274797410035061,2.390551639925522] sigma=
0.10976751308928101  -0.186908554330941  
-0.186908554330941   0.7759289399492513  

w=0.065259 mu=[3.3783618332560876,5.622632293334024] sigma=
0.10109765051996433  0.0320694359617697   
0.0320694359617697   0.03873645329222697  

w=0.028714 mu=[6.146091367146795,5.732902319554125] sigma=
0.23893543994099530.023579597914199724  
0.023579597914199724  0.1377941370353355

Multiplying the MU values back by 10 they show pretty good fidelity to the 
truth values in s1-cb.txt provided on the source website for the dataset; 
unfortunately, I do not see the original weight and covariance values used to 
generate the data.

Of course it would be easier to use if the scaling step was not necessary; I 
can modify the cluster initialization to use a covariance estimated from a 
sample and see how it works out.  What strategy did you use for initializing 
clusters in your implementation?

cc: [~MeethuMathew]

> Add expectation maximization for Gaussian mixture models to MLLib clustering
> 
>
> Key: SPARK-4156
> URL: https://issues.apache.org/jira/browse/SPARK-4156
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Travis Galoppo
>Assignee: Travis Galoppo
>
> As an additional clustering algorithm, implement expectation maximization for 
> Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2710) Build SchemaRDD from a JdbcRDD with MetaData (no hard-coded case class)

2014-12-02 Thread Joerg Schad (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231481#comment-14231481
 ] 

Joerg Schad commented on SPARK-2710:


Hi,
is there any documentation about the Data Source Api available?
Despite being new to Spark development, I would like give it try --if no one 
else is yet working on it--.
Still any pointer are welcome :-).

Joerg

> Build SchemaRDD from a JdbcRDD with MetaData (no hard-coded case class)
> ---
>
> Key: SPARK-2710
> URL: https://issues.apache.org/jira/browse/SPARK-2710
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Reporter: Teng Qiu
>
> Spark SQL can take Parquet files or JSON files as a table directly (without 
> given a case class to define the schema)
> as a component named SQL, it should also be able to take a ResultSet from 
> RDBMS easily.
> i find that there is a JdbcRDD in core: 
> core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
> so i want to make some small change in this file to allow SQLContext to read 
> the MetaData from the PreparedStatement (read metadata do not need to execute 
> the query really).
> Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his 
> MetaData.
> In the further, maybe we can add a feature in sql-shell, so that user can 
> using spark-thrift-server join tables from different sources
> such as:
> {code}
> CREATE TABLE jdbc_tbl1 AS JDBC "connectionString" "username" "password" 
> "initQuery" "bound" ...
> CREATE TABLE parquet_files AS PARQUET "hdfs://tmp/parquet_table/"
> SELECT parquet_files.colX, jdbc_tbl1.colY
>   FROM parquet_files
>   JOIN jdbc_tbl1
> ON (parquet_files.id = jdbc_tbl1.id)
> {code}
> I think such a feature will be useful, like facebook Presto engine does.
> oh, and there is a small bug in JdbcRDD
> in compute(), method close()
> {code}
> if (null != conn && ! stmt.isClosed()) conn.close()
> {code}
> should be
> {code}
> if (null != conn && ! conn.isClosed()) conn.close()
> {code}
> just a small write error :)
> but such a close method will never be able to close conn...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4695) Get result using executeCollect in spark sql

2014-12-02 Thread wangfei (JIRA)
wangfei created SPARK-4695:
--

 Summary:  Get result using executeCollect in spark sql 
 Key: SPARK-4695
 URL: https://issues.apache.org/jira/browse/SPARK-4695
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.3.0


We should use executeCollect to collect the result, because executeCollect is a 
custom implementation of collect in spark sql which better than rdd's collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4695) Get result using executeCollect in spark sql

2014-12-02 Thread wangfei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-4695:
---
Issue Type: Improvement  (was: Bug)

>  Get result using executeCollect in spark sql 
> --
>
> Key: SPARK-4695
> URL: https://issues.apache.org/jira/browse/SPARK-4695
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.3.0
>
>
> We should use executeCollect to collect the result, because executeCollect is 
> a custom implementation of collect in spark sql which better than rdd's 
> collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4695) Get result using executeCollect in spark sql

2014-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231560#comment-14231560
 ] 

Apache Spark commented on SPARK-4695:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/3547

>  Get result using executeCollect in spark sql 
> --
>
> Key: SPARK-4695
> URL: https://issues.apache.org/jira/browse/SPARK-4695
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.3.0
>
>
> We should use executeCollect to collect the result, because executeCollect is 
> a custom implementation of collect in spark sql which better than rdd's 
> collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4696) Yarn: spark.driver.extra* variables not applied consistently to yarn client mode AM

2014-12-02 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-4696:


 Summary: Yarn: spark.driver.extra* variables not applied 
consistently to yarn client mode AM
 Key: SPARK-4696
 URL: https://issues.apache.org/jira/browse/SPARK-4696
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves


Currently the configs to set the driver options (spark.driver.extra* -> java 
options, classpath, library path) are not applied consistently by yarn to the 
yarn client mode application master. Right now the classpath is applied but the 
java options and library path aren't.  We should make it consistent.

Note this is a result of discussions on the pull request for 
https://issues.apache.org/jira/browse/SPARK-4461
so this might be affected by the outcome of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4697) System properties should override environment variables

2014-12-02 Thread WangTaoTheTonic (JIRA)
WangTaoTheTonic created SPARK-4697:
--

 Summary: System properties should override environment variables
 Key: SPARK-4697
 URL: https://issues.apache.org/jira/browse/SPARK-4697
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: WangTaoTheTonic


I found some arguments in yarn module take environment variables before system 
properties while the latter override the former in core module.
This should be changed in org.apache.spark.deploy.yarn.ClientArguments and 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4697) System properties should override environment variables

2014-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231589#comment-14231589
 ] 

Apache Spark commented on SPARK-4697:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/3557

> System properties should override environment variables
> ---
>
> Key: SPARK-4697
> URL: https://issues.apache.org/jira/browse/SPARK-4697
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: WangTaoTheTonic
>
> I found some arguments in yarn module take environment variables before 
> system properties while the latter override the former in core module.
> This should be changed in org.apache.spark.deploy.yarn.ClientArguments and 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4698) Data-locality aware Partitioners

2014-12-02 Thread Kevin Mader (JIRA)
Kevin Mader created SPARK-4698:
--

 Summary: Data-locality aware Partitioners
 Key: SPARK-4698
 URL: https://issues.apache.org/jira/browse/SPARK-4698
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kevin Mader
Priority: Minor


The current hash and range partitioner tools do not seem to respect the 
existing data-locality. A 'dictionary' driven partitioner that calculated the 
partitions based on the existing key locations instead of re-calculating them 
would be ideal.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4560) Lambda deserialization error

2014-12-02 Thread Alexis Seigneurin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis Seigneurin updated SPARK-4560:
-
Affects Version/s: 1.1.1

> Lambda deserialization error
> 
>
> Key: SPARK-4560
> URL: https://issues.apache.org/jira/browse/SPARK-4560
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0, 1.1.1
> Environment: Java 8.0.25
>Reporter: Alexis Seigneurin
> Attachments: IndexTweets.java, pom.xml
>
>
> I'm getting an error saying a lambda could not be deserialized. Here is the 
> code:
> {code}
> TwitterUtils.createStream(sc, twitterAuth, filters)
> .map(t -> t.getText())
> .foreachRDD(tweets -> {
> tweets.foreach(x -> System.out.println(x));
> return null;
> });
> {code}
> Here is the exception:
> {noformat}
> java.io.IOException: unexpected exception type
>   at 
> java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538)
>   at 
> java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1110)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1810)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
>   at org.apache.spark.scheduler.Task.run(Task.scala:54)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
>   ... 27 more
> Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization
>   at 
> com.seigneurin.spark.IndexTweets.$deserializeLambda$(IndexTweets.java:1)
>   ... 37 more
> {noformat}
> The weird thing is, if I write the following code (the map operation is 
> inside the foreachRDD), it works without problem.
> {code}
> TwitterUtils.createStream(sc, twitterAuth, filters)
> .foreachRDD(tweets -> {
> tweets.map(t -> t.getText())
> .foreach(x -> System.out.println(x));
> return null;
> 

[jira] [Commented] (SPARK-4156) Add expectation maximization for Gaussian mixture models to MLLib clustering

2014-12-02 Thread Travis Galoppo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231682#comment-14231682
 ] 

Travis Galoppo commented on SPARK-4156:
---

I do have a bug in the DenseGmmEM example code... the delta value is ignored, 
so all runs are using the default value of 0.01.  I will fix ASAP.


> Add expectation maximization for Gaussian mixture models to MLLib clustering
> 
>
> Key: SPARK-4156
> URL: https://issues.apache.org/jira/browse/SPARK-4156
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Travis Galoppo
>Assignee: Travis Galoppo
>
> As an additional clustering algorithm, implement expectation maximization for 
> Gaussian mixture models



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

2014-12-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231697#comment-14231697
 ] 

Micael Capitão commented on SPARK-3553:
---

I'm having that same issue running Spark Streaming locally on my Windows 
machine.
I have somthing like:
val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](cdrsDir, 
fileFilter(_), newFilesOnly = false)

The "cdrsDir" has initially 2 files in it.

On startup, Spark processes the existing files on "cdrsDir" and keeps quite 
after that. When I move another file to that dir it detects it and processes 
it, but after that it processes the first two files again and then the third 
one in an endless loop.
If I add a fourth one it keeps processing the first two files on the same batch 
and then processes the 3rd and the 4th files on another batch.

If I add more files it keeps repeating but the behaviour gets weirder. It 
mixtures, for example, the 3rd with the 5th on the same batch and the 4th with 
the 6th in another batch, stopping repeating the first two files.

> Spark Streaming app streams files that have already been streamed in an 
> endless loop
> 
>
> Key: SPARK-3553
> URL: https://issues.apache.org/jira/browse/SPARK-3553
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.1
> Environment: Ec2 cluster - YARN
>Reporter: Ezequiel Bella
>  Labels: S3, Streaming, YARN
>
> We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
> and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
> of RAM each.
> The app streams from a directory in S3 which is constantly being written; 
> this is the line of code that achieves that:
> val lines = ssc.fileStream[LongWritable, Text, 
> TextInputFormat](Settings.S3RequestsHost  , (f:Path)=> true, true )
> The purpose of using fileStream instead of textFileStream is to customize the 
> way that spark handles existing files when the process starts. We want to 
> process just the new files that are added after the process launched and omit 
> the existing ones. We configured a batch duration of 10 seconds.
> The process goes fine while we add a small number of files to s3, let's say 4 
> or 5. We can see in the streaming UI how the stages are executed successfully 
> in the executors, one for each file that is processed. But when we try to add 
> a larger number of files, we face a strange behavior; the application starts 
> streaming files that have already been streamed. 
> For example, I add 20 files to s3. The files are processed in 3 batches. The 
> first batch processes 7 files, the second 8 and the third 5. No more files 
> are added to S3 at this point, but spark start repeating these phases 
> endlessly with the same files.
> Any thoughts what can be causing this?
> Regards,
> Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

2014-12-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231697#comment-14231697
 ] 

Micael Capitão edited comment on SPARK-3553 at 12/2/14 4:28 PM:


I'm having that same issue running Spark Streaming locally on my Windows 
machine.
I have somthing like:
val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](cdrsDir, 
fileFilter(_), newFilesOnly = false)

The "cdrsDir" has initially 2 files in it.

On startup, Spark processes the existing files on "cdrsDir" and keeps quite 
after that. When I move another file to that dir it detects it and processes 
it, but after that it processes the first two files again and then the third 
one, repeating this in an endless loop.
If I add a fourth one it keeps processing the first two files on the same batch 
and then processes the 3rd and the 4th files on another batch.

If I add more files it keeps repeating but the behaviour gets weirder. It 
mixtures, for example, the 3rd with the 5th on the same batch and the 4th with 
the 6th in another batch, stopping repeating the first two files.


was (Author: capitao):
I'm having that same issue running Spark Streaming locally on my Windows 
machine.
I have somthing like:
val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](cdrsDir, 
fileFilter(_), newFilesOnly = false)

The "cdrsDir" has initially 2 files in it.

On startup, Spark processes the existing files on "cdrsDir" and keeps quite 
after that. When I move another file to that dir it detects it and processes 
it, but after that it processes the first two files again and then the third 
one in an endless loop.
If I add a fourth one it keeps processing the first two files on the same batch 
and then processes the 3rd and the 4th files on another batch.

If I add more files it keeps repeating but the behaviour gets weirder. It 
mixtures, for example, the 3rd with the 5th on the same batch and the 4th with 
the 6th in another batch, stopping repeating the first two files.

> Spark Streaming app streams files that have already been streamed in an 
> endless loop
> 
>
> Key: SPARK-3553
> URL: https://issues.apache.org/jira/browse/SPARK-3553
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.1
> Environment: Ec2 cluster - YARN
>Reporter: Ezequiel Bella
>  Labels: S3, Streaming, YARN
>
> We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
> and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
> of RAM each.
> The app streams from a directory in S3 which is constantly being written; 
> this is the line of code that achieves that:
> val lines = ssc.fileStream[LongWritable, Text, 
> TextInputFormat](Settings.S3RequestsHost  , (f:Path)=> true, true )
> The purpose of using fileStream instead of textFileStream is to customize the 
> way that spark handles existing files when the process starts. We want to 
> process just the new files that are added after the process launched and omit 
> the existing ones. We configured a batch duration of 10 seconds.
> The process goes fine while we add a small number of files to s3, let's say 4 
> or 5. We can see in the streaming UI how the stages are executed successfully 
> in the executors, one for each file that is processed. But when we try to add 
> a larger number of files, we face a strange behavior; the application starts 
> streaming files that have already been streamed. 
> For example, I add 20 files to s3. The files are processed in 3 batches. The 
> first batch processes 7 files, the second 8 and the third 5. No more files 
> are added to S3 at this point, but spark start repeating these phases 
> endlessly with the same files.
> Any thoughts what can be causing this?
> Regards,
> Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

2014-12-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231697#comment-14231697
 ] 

Micael Capitão edited comment on SPARK-3553 at 12/2/14 4:30 PM:


I'm having that same issue running Spark Streaming 1.1.0 locally on my Windows 
machine.
I have somthing like:
val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](cdrsDir, 
fileFilter(_), newFilesOnly = false)

The "cdrsDir" has initially 2 files in it.

On startup, Spark processes the existing files on "cdrsDir" and keeps quite 
after that. When I move another file to that dir it detects it and processes 
it, but after that it processes the first two files again and then the third 
one, repeating this in an endless loop.
If I add a fourth one it keeps processing the first two files on the same batch 
and then processes the 3rd and the 4th files on another batch.

If I add more files it keeps repeating but the behaviour gets weirder. It 
mixtures, for example, the 3rd with the 5th on the same batch and the 4th with 
the 6th in another batch, stopping repeating the first two files.


was (Author: capitao):
I'm having that same issue running Spark Streaming locally on my Windows 
machine.
I have somthing like:
val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](cdrsDir, 
fileFilter(_), newFilesOnly = false)

The "cdrsDir" has initially 2 files in it.

On startup, Spark processes the existing files on "cdrsDir" and keeps quite 
after that. When I move another file to that dir it detects it and processes 
it, but after that it processes the first two files again and then the third 
one, repeating this in an endless loop.
If I add a fourth one it keeps processing the first two files on the same batch 
and then processes the 3rd and the 4th files on another batch.

If I add more files it keeps repeating but the behaviour gets weirder. It 
mixtures, for example, the 3rd with the 5th on the same batch and the 4th with 
the 6th in another batch, stopping repeating the first two files.

> Spark Streaming app streams files that have already been streamed in an 
> endless loop
> 
>
> Key: SPARK-3553
> URL: https://issues.apache.org/jira/browse/SPARK-3553
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.1
> Environment: Ec2 cluster - YARN
>Reporter: Ezequiel Bella
>  Labels: S3, Streaming, YARN
>
> We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
> and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
> of RAM each.
> The app streams from a directory in S3 which is constantly being written; 
> this is the line of code that achieves that:
> val lines = ssc.fileStream[LongWritable, Text, 
> TextInputFormat](Settings.S3RequestsHost  , (f:Path)=> true, true )
> The purpose of using fileStream instead of textFileStream is to customize the 
> way that spark handles existing files when the process starts. We want to 
> process just the new files that are added after the process launched and omit 
> the existing ones. We configured a batch duration of 10 seconds.
> The process goes fine while we add a small number of files to s3, let's say 4 
> or 5. We can see in the streaming UI how the stages are executed successfully 
> in the executors, one for each file that is processed. But when we try to add 
> a larger number of files, we face a strange behavior; the application starts 
> streaming files that have already been streamed. 
> For example, I add 20 files to s3. The files are processed in 3 batches. The 
> first batch processes 7 files, the second 8 and the third 5. No more files 
> are added to S3 at this point, but spark start repeating these phases 
> endlessly with the same files.
> Any thoughts what can be causing this?
> Regards,
> Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3523) GraphX graph partitioning strategy

2014-12-02 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231734#comment-14231734
 ] 

Larry Xiao commented on SPARK-3523:
---

Hi [~lianhuiwang], that's great!
As you can see, to allow more complex partitioning, the interface need to be 
change, because default partitioners are hash based.
Another option is to partition the graph externally, as [~ankurd] suggested. 
What do you think?

> GraphX graph partitioning strategy
> --
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
> adjacent to nearly half of the edges.
> * For high-degree vertex, one vertex concentrates vast resources. So the 
> workload on few high-degree vertex should be decomposed by all machines
> *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
> should exploit the locality of the computation on low-degree vertex.
> h5. Algorithm Description
> * HybridCut 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
> * HybridCutPlus 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
> * Greedy BiCut
>   * a heuristic algorithm for bipartite
> h5. Result
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%!
>   * The left Y axis is replication factor, right axis is the balance 
> (measured using CV, coefficient of variation) of either vertices or edges of 
> all partitions. The balance of edges can infer computation balance, and the 
> balance of vertices can infer communication balance.
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
>  
>   * This is an example of a balanced partitioning achieving 20% saving on 
> communication.
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
>   * This is a simple partitioning result of BiCut.
> * in-2.0-1m is a generated power law graph with alpha equals 2.0
> h5. Code
> * 
> https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
> * Because the implementation breaks the current separation with 
> PartitionStrategy.scala, so need to think of a way to support access to graph.
> h5. Reference
> - Bipartite-oriented Distributed Graph Partitioning for Big Learning.
> - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed 
> Graphs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

2014-12-02 Thread Ezequiel Bella (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231737#comment-14231737
 ] 

Ezequiel Bella commented on SPARK-3553:
---

Please see if this post works for you,
http://stackoverflow.com/questions/25894405/spark-streaming-app-streams-files-that-have-already-been-streamed

good luck.
easy

> Spark Streaming app streams files that have already been streamed in an 
> endless loop
> 
>
> Key: SPARK-3553
> URL: https://issues.apache.org/jira/browse/SPARK-3553
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.1
> Environment: Ec2 cluster - YARN
>Reporter: Ezequiel Bella
>  Labels: S3, Streaming, YARN
>
> We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
> and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
> of RAM each.
> The app streams from a directory in S3 which is constantly being written; 
> this is the line of code that achieves that:
> val lines = ssc.fileStream[LongWritable, Text, 
> TextInputFormat](Settings.S3RequestsHost  , (f:Path)=> true, true )
> The purpose of using fileStream instead of textFileStream is to customize the 
> way that spark handles existing files when the process starts. We want to 
> process just the new files that are added after the process launched and omit 
> the existing ones. We configured a batch duration of 10 seconds.
> The process goes fine while we add a small number of files to s3, let's say 4 
> or 5. We can see in the streaming UI how the stages are executed successfully 
> in the executors, one for each file that is processed. But when we try to add 
> a larger number of files, we face a strange behavior; the application starts 
> streaming files that have already been streamed. 
> For example, I add 20 files to s3. The files are processed in 3 batches. The 
> first batch processes 7 files, the second 8 and the third 5. No more files 
> are added to S3 at this point, but spark start repeating these phases 
> endlessly with the same files.
> Any thoughts what can be causing this?
> Regards,
> Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3523) GraphX graph partitioning strategy

2014-12-02 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231734#comment-14231734
 ] 

Larry Xiao edited comment on SPARK-3523 at 12/2/14 4:50 PM:


Hi [~lianhuiwang], that's great!
Currently, to allow more complex partitioning, I changed the interface of 
partitioner, because default partitioners are hash based, and didn't need much 
information about graph.
An option is to partition the graph externally, as [~ankurd] suggested. 
What do you think?


was (Author: larryxiao):
Hi [~lianhuiwang], that's great!
As you can see, to allow more complex partitioning, the interface need to be 
change, because default partitioners are hash based.
Another option is to partition the graph externally, as [~ankurd] suggested. 
What do you think?

> GraphX graph partitioning strategy
> --
>
> Key: SPARK-3523
> URL: https://issues.apache.org/jira/browse/SPARK-3523
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.0.2
>Reporter: Larry Xiao
>
> We implemented some algorithms for partitioning on GraphX, and evaluated. And 
> find the partitioning has space of improving. Seek opinion and advice.
> h5. Motivation
> * Graph in real world follow power law. Eg. On twitter 1% of the vertices are 
> adjacent to nearly half of the edges.
> * For high-degree vertex, one vertex concentrates vast resources. So the 
> workload on few high-degree vertex should be decomposed by all machines
> *  For low-degree vertex, The computation on one vertex is  quite small. Thus 
> should exploit the locality of the computation on low-degree vertex.
> h5. Algorithm Description
> * HybridCut 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCut.png|width=360!
> * HybridCutPlus 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/HybridCutPlus.png|width=360!
> * Greedy BiCut
>   * a heuristic algorithm for bipartite
> h5. Result
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/FactorBalance.png|width=100%!
>   * The left Y axis is replication factor, right axis is the balance 
> (measured using CV, coefficient of variation) of either vertices or edges of 
> all partitions. The balance of edges can infer computation balance, and the 
> balance of vertices can infer communication balance.
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Shuffle.png|width=360!
>  
>   * This is an example of a balanced partitioning achieving 20% saving on 
> communication.
> * 
> !https://raw.githubusercontent.com/larryxiao/spark/GraphX/Arkansol.Analyse/Bipartite.png|width=360!
>   * This is a simple partitioning result of BiCut.
> * in-2.0-1m is a generated power law graph with alpha equals 2.0
> h5. Code
> * 
> https://github.com/larryxiao/spark/blob/GraphX/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L173
> * Because the implementation breaks the current separation with 
> PartitionStrategy.scala, so need to think of a way to support access to graph.
> h5. Reference
> - Bipartite-oriented Distributed Graph Partitioning for Big Learning.
> - PowerLyra : Differentiated Graph Computation and Partitioning on Skewed 
> Graphs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4699) Make caseSensitive configurable in Analyzer.scala

2014-12-02 Thread Jacky Li (JIRA)
Jacky Li created SPARK-4699:
---

 Summary: Make caseSensitive configurable in Analyzer.scala
 Key: SPARK-4699
 URL: https://issues.apache.org/jira/browse/SPARK-4699
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Jacky Li
 Fix For: 1.2.0


Currently, case sensitivity is true by default in Analyzer. It should be 
configurable by setting SQLConf in the client application



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4686) Link to "allowed master URLs" is broken in configuration documentation

2014-12-02 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-4686.
---
   Resolution: Fixed
Fix Version/s: 1.1.2
   1.2.0

> Link to "allowed master URLs" is broken in configuration documentation
> --
>
> Key: SPARK-4686
> URL: https://issues.apache.org/jira/browse/SPARK-4686
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 1.2.0, 1.1.2
>
>
> The link points to the old scala programming guide; it should point to the 
> submitting applications page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4699) Make caseSensitive configurable in Analyzer.scala

2014-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231797#comment-14231797
 ] 

Apache Spark commented on SPARK-4699:
-

User 'jackylk' has created a pull request for this issue:
https://github.com/apache/spark/pull/3558

> Make caseSensitive configurable in Analyzer.scala
> -
>
> Key: SPARK-4699
> URL: https://issues.apache.org/jira/browse/SPARK-4699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Jacky Li
> Fix For: 1.2.0
>
>
> Currently, case sensitivity is true by default in Analyzer. It should be 
> configurable by setting SQLConf in the client application



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

2014-12-02 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231815#comment-14231815
 ] 

Micael Capitão commented on SPARK-3553:
---

I've already seen that post. It didn't work for me...
I have a filter for .gz files and like you, I've tried to put the files in the 
dir as .gz.tmp and then renaming to remove the .tmp. The behaviour, in my case, 
is like the one I've described before.

I'm going to check if it behaves decently using HDFS for the stream data dir...

> Spark Streaming app streams files that have already been streamed in an 
> endless loop
> 
>
> Key: SPARK-3553
> URL: https://issues.apache.org/jira/browse/SPARK-3553
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.1
> Environment: Ec2 cluster - YARN
>Reporter: Ezequiel Bella
>  Labels: S3, Streaming, YARN
>
> We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node 
> and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB 
> of RAM each.
> The app streams from a directory in S3 which is constantly being written; 
> this is the line of code that achieves that:
> val lines = ssc.fileStream[LongWritable, Text, 
> TextInputFormat](Settings.S3RequestsHost  , (f:Path)=> true, true )
> The purpose of using fileStream instead of textFileStream is to customize the 
> way that spark handles existing files when the process starts. We want to 
> process just the new files that are added after the process launched and omit 
> the existing ones. We configured a batch duration of 10 seconds.
> The process goes fine while we add a small number of files to s3, let's say 4 
> or 5. We can see in the streaming UI how the stages are executed successfully 
> in the executors, one for each file that is processed. But when we try to add 
> a larger number of files, we face a strange behavior; the application starts 
> streaming files that have already been streamed. 
> For example, I add 20 files to s3. The files are processed in 3 batches. The 
> first batch processes 7 files, the second 8 and the third 5. No more files 
> are added to S3 at this point, but spark start repeating these phases 
> endlessly with the same files.
> Any thoughts what can be causing this?
> Regards,
> Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4694) Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in yarn-client mode

2014-12-02 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231864#comment-14231864
 ] 

Marcelo Vanzin commented on SPARK-4694:
---

I'm not sure I understand the bug or the context, but there must be some code 
that manages both the SparkContext and the HiveThriftServer2 thread. That code 
is responsible for stopping the context and shutting down the HiveThriftServer2 
thread; if it can't do it cleanly because of some deficiency of the API, it can 
use Thread.stop() or some other less kosher approach.

Using {{System.exit()}} is not recommended because there's no way for the 
backend to detect that without severe performance implications. Apps will 
always be reported as "succeeded" when using that approach.

> Long-run user thread(such as HiveThriftServer2) causes the 'process leak' in 
> yarn-client mode
> -
>
> Key: SPARK-4694
> URL: https://issues.apache.org/jira/browse/SPARK-4694
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: SaintBacchus
>
> Recently when I use the Yarn HA mode to test the HiveThriftServer2 I found a 
> problem that the driver can't exit by itself.
> To reappear it, you can do as fellow:
> 1.use yarn HA mode and set am.maxAttemp = 1for convenience
> 2.kill the active resouce manager in cluster
> The expect result is just failed, because the maxAttemp was 1.
> But the actual result is that: all executor was ended but the driver was 
> still there and never close.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-4672:
---
Fix Version/s: 1.2.0

> Cut off the super long serialization chain in GraphX to avoid the 
> StackOverflow error
> -
>
> Key: SPARK-4672
> URL: https://issues.apache.org/jira/browse/SPARK-4672
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.1.0
>Reporter: Lijie Xu
>Priority: Critical
> Fix For: 1.2.0
>
>
> While running iterative algorithms in GraphX, a StackOverflow error will 
> stably occur in the serialization phase at about 300th iteration. In general, 
> these kinds of algorithms have two things in common:
> # They have a long computing chain.
> {code:borderStyle=solid}
> (e.g., “degreeGraph=>subGraph=>degreeGraph=>subGraph=>…=>”)
> {code}
> # They will iterate many times to converge. An example:
> {code:borderStyle=solid}
> //K-Core Algorithm
> val kNum = 5
> var degreeGraph = graph.outerJoinVertices(graph.degrees) {
>   (vid, vd, degree) => degree.getOrElse(0)
> }.cache()
>   
> do {
>   val subGraph = degreeGraph.subgraph(
>   vpred = (vid, degree) => degree >= KNum
>   ).cache()
>   val newDegreeGraph = subGraph.degrees
>   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
>   (vid, vd, degree) => degree.getOrElse(0)
>   }.cache()
>   isConverged = check(degreeGraph)
> } while(isConverged == false)
> {code}
> After about 300 iterations, StackOverflow will definitely occur with the 
> following stack trace:
> {code:borderStyle=solid}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task serialization failed: java.lang.StackOverflowError
> java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
> java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> {code}
> It is a very tricky bug, which only occurs with enough iterations. Since it 
> took us a long time to find out its causes, we will detail the causes in the 
> following 3 paragraphs. 
>  
> h3. Phase 1: Try using checkpoint() to shorten the lineage
> It's easy to come to the thought that the long lineage may be the cause. For 
> some RDDs, their lineages may grow with the iterations. Also, for some 
> magical references,  their lineage lengths never decrease and finally become 
> very long. As a result, the call stack of task's 
> serialization()/deserialization() method will be very long too, which finally 
> exhausts the whole JVM stack.
> In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
> OneToOne dependencies in each iteration in the above example. Lineage length 
> refers to the  maximum length of OneToOne dependencies (e.g., from the 
> finalRDD to the ShuffledRDD) in each stage.
> To shorten the lineage, a checkpoint() is performed every N (e.g., 10) 
> iterations. Then, the lineage will drop down when it reaches a certain length 
> (e.g., 33). 
> However, StackOverflow error still occurs after 300+ iterations!
> h3. Phase 2:  Abnormal f closure function leads to a unbreakable 
> serialization chain
> After a long-time debug, we found that an abnormal _*f*_ function closure and 
> a potential bug in GraphX (will be detailed in Phase 3) are the "Suspect 
> Zero". They together build another serialization chain that can bypass the 
> broken lineage cut by checkpoint() (as shown in Figure 1). In other words, 
> the serialization chain can be as long as the original lineage before 
> checkpoint().
> Figure 1 shows how the unbreakable serialization chain is generated. Yes, the 
> OneToOneDep can be cut off by checkpoint(). However, the serialization chain 
> can still access the previous RDDs through the (1)->(2) reference chain. As a 
> result, the checkpoint() action is meaningless and the lineage is as long as 
> that before. 
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%!
> The (1)->(2) chain can be observed in the debug view (in Figure 2).
> {code:borderStyle=solid}
> _rdd (i.e., A in Figure 1, checkpointed) -> f -> $outer (VertexRDD) -> 
> partitionsRDD:MapPartitionsRDD -> RDDs in  the previous iterations
> {code}
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g2.png|width=100%!
> More description: While a RDD is being serialized, its f function 
> {code:borderStyle=solid}
> e.g., f: (Iterator[A], Iterator[B]) => Iterator[V]) in ZippedPartitionsRDD2
> {code}
> will be serialized too. This action will be very dangero

[jira] [Commented] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-02 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231891#comment-14231891
 ] 

Reynold Xin commented on SPARK-4672:


cc [~ankurdave]

Can you take a look at this ASAP? Would be great to fix for 1.2.


> Cut off the super long serialization chain in GraphX to avoid the 
> StackOverflow error
> -
>
> Key: SPARK-4672
> URL: https://issues.apache.org/jira/browse/SPARK-4672
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.1.0
>Reporter: Lijie Xu
>Priority: Critical
> Fix For: 1.2.0
>
>
> While running iterative algorithms in GraphX, a StackOverflow error will 
> stably occur in the serialization phase at about 300th iteration. In general, 
> these kinds of algorithms have two things in common:
> # They have a long computing chain.
> {code:borderStyle=solid}
> (e.g., “degreeGraph=>subGraph=>degreeGraph=>subGraph=>…=>”)
> {code}
> # They will iterate many times to converge. An example:
> {code:borderStyle=solid}
> //K-Core Algorithm
> val kNum = 5
> var degreeGraph = graph.outerJoinVertices(graph.degrees) {
>   (vid, vd, degree) => degree.getOrElse(0)
> }.cache()
>   
> do {
>   val subGraph = degreeGraph.subgraph(
>   vpred = (vid, degree) => degree >= KNum
>   ).cache()
>   val newDegreeGraph = subGraph.degrees
>   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
>   (vid, vd, degree) => degree.getOrElse(0)
>   }.cache()
>   isConverged = check(degreeGraph)
> } while(isConverged == false)
> {code}
> After about 300 iterations, StackOverflow will definitely occur with the 
> following stack trace:
> {code:borderStyle=solid}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task serialization failed: java.lang.StackOverflowError
> java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
> java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> {code}
> It is a very tricky bug, which only occurs with enough iterations. Since it 
> took us a long time to find out its causes, we will detail the causes in the 
> following 3 paragraphs. 
>  
> h3. Phase 1: Try using checkpoint() to shorten the lineage
> It's easy to come to the thought that the long lineage may be the cause. For 
> some RDDs, their lineages may grow with the iterations. Also, for some 
> magical references,  their lineage lengths never decrease and finally become 
> very long. As a result, the call stack of task's 
> serialization()/deserialization() method will be very long too, which finally 
> exhausts the whole JVM stack.
> In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
> OneToOne dependencies in each iteration in the above example. Lineage length 
> refers to the  maximum length of OneToOne dependencies (e.g., from the 
> finalRDD to the ShuffledRDD) in each stage.
> To shorten the lineage, a checkpoint() is performed every N (e.g., 10) 
> iterations. Then, the lineage will drop down when it reaches a certain length 
> (e.g., 33). 
> However, StackOverflow error still occurs after 300+ iterations!
> h3. Phase 2:  Abnormal f closure function leads to a unbreakable 
> serialization chain
> After a long-time debug, we found that an abnormal _*f*_ function closure and 
> a potential bug in GraphX (will be detailed in Phase 3) are the "Suspect 
> Zero". They together build another serialization chain that can bypass the 
> broken lineage cut by checkpoint() (as shown in Figure 1). In other words, 
> the serialization chain can be as long as the original lineage before 
> checkpoint().
> Figure 1 shows how the unbreakable serialization chain is generated. Yes, the 
> OneToOneDep can be cut off by checkpoint(). However, the serialization chain 
> can still access the previous RDDs through the (1)->(2) reference chain. As a 
> result, the checkpoint() action is meaningless and the lineage is as long as 
> that before. 
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%!
> The (1)->(2) chain can be observed in the debug view (in Figure 2).
> {code:borderStyle=solid}
> _rdd (i.e., A in Figure 1, checkpointed) -> f -> $outer (VertexRDD) -> 
> partitionsRDD:MapPartitionsRDD -> RDDs in  the previous iterations
> {code}
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g2.png|width=100%!
> More description: While a RDD is being serialized, its f function 
> {code:borderStyle=solid}
> e.g., f: (Iterator[A], Iter

[jira] [Created] (SPARK-4700) Add Http support to Spark Thrift server

2014-12-02 Thread Judy Nash (JIRA)
Judy Nash created SPARK-4700:


 Summary: Add Http support to Spark Thrift server
 Key: SPARK-4700
 URL: https://issues.apache.org/jira/browse/SPARK-4700
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.2.1
 Environment: Linux and Windows
Reporter: Judy Nash


Currently thrift only supports TCP connection. 

The ask is to add HTTP connection as well. Both TCP and HTTP are supported by 
Hive today. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3641) Correctly populate SparkPlan.currentContext

2014-12-02 Thread Kapil Malik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231964#comment-14231964
 ] 

Kapil Malik commented on SPARK-3641:


Hi all,

Is this expected to be fixed with Spark 1.2.0 release ?
Any work around that I can use till then ? A simple join on 2 cached tables is 
a pretty regular use case. Can I do something to avoid the NPE for that?

Regards,

Kapil

> Correctly populate SparkPlan.currentContext
> ---
>
> Key: SPARK-3641
> URL: https://issues.apache.org/jira/browse/SPARK-3641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Yin Huai
>Assignee: Michael Armbrust
>Priority: Critical
>
> After creating a new SQLContext, we need to populate SparkPlan.currentContext 
> before we create any SparkPlan. Right now, only SQLContext.createSchemaRDD 
> populate SparkPlan.currentContext. SQLContext.applySchema is missing this 
> call and we can have NPE as described in 
> http://qnalist.com/questions/5162981/spark-sql-1-1-0-npe-when-join-two-cached-table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4616) SPARK_CONF_DIR is not effective in spark-submit

2014-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232030#comment-14232030
 ] 

Apache Spark commented on SPARK-4616:
-

User 'brennonyork' has created a pull request for this issue:
https://github.com/apache/spark/pull/3559

> SPARK_CONF_DIR is not effective in spark-submit
> ---
>
> Key: SPARK-4616
> URL: https://issues.apache.org/jira/browse/SPARK-4616
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: leo.luan
>
> SPARK_CONF_DIR is not effective in spark-submit ,because this line in 
> spark-submit:
> DEFAULT_PROPERTIES_FILE="$SPARK_HOME/conf/spark-defaults.conf"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4701) Typo in sbt/sbt

2014-12-02 Thread Masayoshi TSUZUKI (JIRA)
Masayoshi TSUZUKI created SPARK-4701:


 Summary: Typo in sbt/sbt
 Key: SPARK-4701
 URL: https://issues.apache.org/jira/browse/SPARK-4701
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Masayoshi TSUZUKI
Priority: Trivial


in sbt/sbt
{noformat}
  -S-X   add -X to sbt's scalacOptions (-J is stripped)
{noformat}
but {{(-S is stripped)}} is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4701) Typo in sbt/sbt

2014-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232037#comment-14232037
 ] 

Apache Spark commented on SPARK-4701:
-

User 'tsudukim' has created a pull request for this issue:
https://github.com/apache/spark/pull/3560

> Typo in sbt/sbt
> ---
>
> Key: SPARK-4701
> URL: https://issues.apache.org/jira/browse/SPARK-4701
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Masayoshi TSUZUKI
>Priority: Trivial
>
> in sbt/sbt
> {noformat}
>   -S-X   add -X to sbt's scalacOptions (-J is stripped)
> {noformat}
> but {{(-S is stripped)}} is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4298) The spark-submit cannot read Main-Class from Manifest.

2014-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232042#comment-14232042
 ] 

Apache Spark commented on SPARK-4298:
-

User 'brennonyork' has created a pull request for this issue:
https://github.com/apache/spark/pull/3561

> The spark-submit cannot read Main-Class from Manifest.
> --
>
> Key: SPARK-4298
> URL: https://issues.apache.org/jira/browse/SPARK-4298
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Linux
> spark-1.1.0-bin-hadoop2.4.tgz
> java version "1.7.0_72"
> Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
> Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)
>Reporter: Milan Straka
>
> Consider trivial {{test.scala}}:
> {code:title=test.scala|borderStyle=solid}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
> object Main {
>   def main(args: Array[String]) {
> val sc = new SparkContext()
> sc.stop()
>   }
> }
> {code}
> When built with {{sbt}} and executed using {{spark-submit 
> target/scala-2.10/test_2.10-1.0.jar}}, I get the following error:
> {code}
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Error: Cannot load main class from JAR: 
> file:/ha/home/straka/s/target/scala-2.10/test_2.10-1.0.jar
> Run with --help for usage help or --verbose for debug output
> {code}
> When executed using {{spark-submit --class Main 
> target/scala-2.10/test_2.10-1.0.jar}}, it works.
> The jar file has correct MANIFEST.MF:
> {code:title=MANIFEST.MF|borderStyle=solid}
> Manifest-Version: 1.0
> Implementation-Vendor: test
> Implementation-Title: test
> Implementation-Version: 1.0
> Implementation-Vendor-Id: test
> Specification-Vendor: test
> Specification-Title: test
> Specification-Version: 1.0
> Main-Class: Main
> {code}
> The problem is that in {{org.apache.spark.deploy.SparkSubmitArguments}}, line 
> 127:
> {code}
>   val jar = new JarFile(primaryResource)
> {code}
> the primaryResource has String value 
> {{"file:/ha/home/straka/s/target/scala-2.10/test_2.10-1.0.jar"}}, which is 
> URI, but JarFile can use only Path. One way to fix this would be using
> {code}
>   val uri = new URI(primaryResource)
>   val jar = new JarFile(uri.getPath)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4536) Add sqrt and abs to Spark SQL DSL

2014-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4536:

Assignee: Kousuke Saruta

> Add sqrt and abs to Spark SQL DSL
> -
>
> Key: SPARK-4536
> URL: https://issues.apache.org/jira/browse/SPARK-4536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
> Fix For: 1.2.0
>
>
> Spark SQL havs embeded sqrt and abs but DSL doesn't support those functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4536) Add sqrt and abs to Spark SQL DSL

2014-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4536.
-
  Resolution: Fixed
   Fix Version/s: 1.2.0
Target Version/s: 1.2.0  (was: 1.3.0)

> Add sqrt and abs to Spark SQL DSL
> -
>
> Key: SPARK-4536
> URL: https://issues.apache.org/jira/browse/SPARK-4536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
> Fix For: 1.2.0
>
>
> Spark SQL havs embeded sqrt and abs but DSL doesn't support those functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3641) Correctly populate SparkPlan.currentContext

2014-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3641:

Fix Version/s: 1.2.0

> Correctly populate SparkPlan.currentContext
> ---
>
> Key: SPARK-3641
> URL: https://issues.apache.org/jira/browse/SPARK-3641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Yin Huai
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.2.0
>
>
> After creating a new SQLContext, we need to populate SparkPlan.currentContext 
> before we create any SparkPlan. Right now, only SQLContext.createSchemaRDD 
> populate SparkPlan.currentContext. SQLContext.applySchema is missing this 
> call and we can have NPE as described in 
> http://qnalist.com/questions/5162981/spark-sql-1-1-0-npe-when-join-two-cached-table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4702) Querying non-existent partition produces exception in v1.2.0-rc1

2014-12-02 Thread Yana Kadiyska (JIRA)
Yana Kadiyska created SPARK-4702:


 Summary: Querying  non-existent partition produces exception in 
v1.2.0-rc1
 Key: SPARK-4702
 URL: https://issues.apache.org/jira/browse/SPARK-4702
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska


Using HiveThriftServer2, when querying a non-existent partition I get an 
exception rather than an empty result set. This seems to be a regression -- I 
had an older build of master branch where this works. Build off of RC1.2 tag 
produces the following:

14/12/02 20:04:12 WARN ThriftCLIService: Error executing statement:
org.apache.hive.service.cli.HiveSQLException: 
java.lang.IllegalArgumentException: Can not create a Path from an empty string
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:192)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
at 
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
at 
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at 
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
at com.sun.proxy.$Proxy19.executeStatementAsync(Unknown Source)
at 
org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:344)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1313)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1298)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3641) Correctly populate SparkPlan.currentContext

2014-12-02 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232080#comment-14232080
 ] 

Michael Armbrust commented on SPARK-3641:
-

This has been fixed for a while and will be in Spark 1.2 and there are nearly 
final RCs already available.  Running any other spark query that does not use 
applySchema in the same thread should also work around the issue.

> Correctly populate SparkPlan.currentContext
> ---
>
> Key: SPARK-3641
> URL: https://issues.apache.org/jira/browse/SPARK-3641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Yin Huai
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.2.0
>
>
> After creating a new SQLContext, we need to populate SparkPlan.currentContext 
> before we create any SparkPlan. Right now, only SQLContext.createSchemaRDD 
> populate SparkPlan.currentContext. SQLContext.applySchema is missing this 
> call and we can have NPE as described in 
> http://qnalist.com/questions/5162981/spark-sql-1-1-0-npe-when-join-two-cached-table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4663) close() function is not surrounded by finally in ParquetTableOperations.scala

2014-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4663.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3526
[https://github.com/apache/spark/pull/3526]

> close() function is not surrounded by finally in ParquetTableOperations.scala
> -
>
> Key: SPARK-4663
> URL: https://issues.apache.org/jira/browse/SPARK-4663
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: baishuo
>Priority: Minor
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4703) Windows path resolution is incorrect

2014-12-02 Thread Andrew Or (JIRA)
Andrew Or created SPARK-4703:


 Summary: Windows path resolution is incorrect
 Key: SPARK-4703
 URL: https://issues.apache.org/jira/browse/SPARK-4703
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Windows
Affects Versions: 1.2.0
 Environment: Windows 8.1
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker


C:/path/to/my.jar is resolved to file:/C:/path/to/my.jar, which is not a valid 
path. This causes simple submit applications to fail.

This is caused by: https://github.com/apache/spark/pull/2795



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4703) Windows path resolution is incorrect

2014-12-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4703.

Resolution: Not a Problem

This is a mistake on my part. The path file:/C:/path/my.jar is actually a valid 
windows path. I even wrote tests for that a while ago. Please disregard.

> Windows path resolution is incorrect
> 
>
> Key: SPARK-4703
> URL: https://issues.apache.org/jira/browse/SPARK-4703
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Windows
>Affects Versions: 1.2.0
> Environment: Windows 8.1
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
>
> C:/path/to/my.jar is resolved to file:/C:/path/to/my.jar, which is not a 
> valid path. This causes simple submit applications to fail.
> This is caused by: https://github.com/apache/spark/pull/2795



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1867) Spark Documentation Error causes java.lang.IllegalStateException: unread block data

2014-12-02 Thread Anson Abraham (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232181#comment-14232181
 ] 

Anson Abraham commented on SPARK-1867:
--

interesting.  so it's possible spark-shell itself was compiled w/ an older 
version of the jdk ... though i "downgraded" the jdk to 6 and i was still 
getting the same error.

> Spark Documentation Error causes java.lang.IllegalStateException: unread 
> block data
> ---
>
> Key: SPARK-1867
> URL: https://issues.apache.org/jira/browse/SPARK-1867
> Project: Spark
>  Issue Type: Bug
>Reporter: sam
>
> I've employed two System Administrators on a contract basis (for quite a bit 
> of money), and both contractors have independently hit the following 
> exception.  What we are doing is:
> 1. Installing Spark 0.9.1 according to the documentation on the website, 
> along with CDH4 (and another cluster with CDH5) distros of hadoop/hdfs.
> 2. Building a fat jar with a Spark app with sbt then trying to run it on the 
> cluster
> I've also included code snippets, and sbt deps at the bottom.
> When I've Googled this, there seems to be two somewhat vague responses:
> a) Mismatching spark versions on nodes/user code
> b) Need to add more jars to the SparkConf
> Now I know that (b) is not the problem having successfully run the same code 
> on other clusters while only including one jar (it's a fat jar).
> But I have no idea how to check for (a) - it appears Spark doesn't have any 
> version checks or anything - it would be nice if it checked versions and 
> threw a "mismatching version exception: you have user code using version X 
> and node Y has version Z".
> I would be very grateful for advice on this.
> The exception:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 
> 0.0:1 failed 32 times (most recent failure: Exception failure: 
> java.lang.IllegalStateException: unread block data)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to 
> java.lang.IllegalStateException: unread block data [duplicate 59]
> My code snippet:
> val conf = new SparkConf()
>.setMaster(clusterMaster)
>.setAppName(appName)
>.setSparkHome(sparkHome)
>.setJars(SparkContext.jarOfClass(this.getClass))
> println("count = " + new SparkContext(conf).textFile(someHdfsPath).count())
> My SBT dependencies:
> // relevant
> "org.apache.spark" % "spark-core_2.10" % "0.9.1",
> "org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0",
> // standard, probably unrelated
> "com.github.seratch" %% "awscala" % "[0.2,)",
> "org.scalacheck" %% "scalacheck" % "1.10.1" % "test",
> "org.specs2" %% "specs2" % "1.14" % "test",
> "org.scala-lang" % "scala-reflect" % "2.10.3",
> "org.scalaz" %% "scalaz-core" % "7.0.5",
> "net.minidev" % "json-smart" % "1.2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsu

[jira] [Created] (SPARK-4704) SparkSubmitDriverBootstrap doesn't flush output

2014-12-02 Thread Stephen Haberman (JIRA)
Stephen Haberman created SPARK-4704:
---

 Summary: SparkSubmitDriverBootstrap doesn't flush output
 Key: SPARK-4704
 URL: https://issues.apache.org/jira/browse/SPARK-4704
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
 Environment: 1.2.0-rc1
Reporter: Stephen Haberman


When running spark-submit with a job that immediately blows up (say due to init 
errors in the job code), there is no error output from spark-submit on the 
console.

When I ran spark-class directly, then I do see the error/stack trace on the 
console.

I believe the issue is in SparkSubmitDriverBootstrapper (I had 
spark.driver.memory set in spark-defaults.conf) not waiting for the  
RedirectThreads to flush/complete before exiting.

E.g. here:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala#L143

I believe around line 165 or so, stdoutThread.join() and
stderrThread.join() calls are necessary to make sure the other threads
have had a chance to flush process.getInputStream/getErrorStream to
System.out/err before the process exits.

I've been tripped up by this in similar RedirectThread/process code, hence 
suspecting this is the problem.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-4672:
---
Comment: was deleted

(was: User 'JerryLead' has created a pull request for this issue:
https://github.com/apache/spark/pull/3537)

> Cut off the super long serialization chain in GraphX to avoid the 
> StackOverflow error
> -
>
> Key: SPARK-4672
> URL: https://issues.apache.org/jira/browse/SPARK-4672
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.1.0
>Reporter: Lijie Xu
>Priority: Critical
> Fix For: 1.2.0
>
>
> While running iterative algorithms in GraphX, a StackOverflow error will 
> stably occur in the serialization phase at about 300th iteration. In general, 
> these kinds of algorithms have two things in common:
> # They have a long computing chain.
> {code:borderStyle=solid}
> (e.g., “degreeGraph=>subGraph=>degreeGraph=>subGraph=>…=>”)
> {code}
> # They will iterate many times to converge. An example:
> {code:borderStyle=solid}
> //K-Core Algorithm
> val kNum = 5
> var degreeGraph = graph.outerJoinVertices(graph.degrees) {
>   (vid, vd, degree) => degree.getOrElse(0)
> }.cache()
>   
> do {
>   val subGraph = degreeGraph.subgraph(
>   vpred = (vid, degree) => degree >= KNum
>   ).cache()
>   val newDegreeGraph = subGraph.degrees
>   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
>   (vid, vd, degree) => degree.getOrElse(0)
>   }.cache()
>   isConverged = check(degreeGraph)
> } while(isConverged == false)
> {code}
> After about 300 iterations, StackOverflow will definitely occur with the 
> following stack trace:
> {code:borderStyle=solid}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task serialization failed: java.lang.StackOverflowError
> java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
> java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> {code}
> It is a very tricky bug, which only occurs with enough iterations. Since it 
> took us a long time to find out its causes, we will detail the causes in the 
> following 3 paragraphs. 
>  
> h3. Phase 1: Try using checkpoint() to shorten the lineage
> It's easy to come to the thought that the long lineage may be the cause. For 
> some RDDs, their lineages may grow with the iterations. Also, for some 
> magical references,  their lineage lengths never decrease and finally become 
> very long. As a result, the call stack of task's 
> serialization()/deserialization() method will be very long too, which finally 
> exhausts the whole JVM stack.
> In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
> OneToOne dependencies in each iteration in the above example. Lineage length 
> refers to the  maximum length of OneToOne dependencies (e.g., from the 
> finalRDD to the ShuffledRDD) in each stage.
> To shorten the lineage, a checkpoint() is performed every N (e.g., 10) 
> iterations. Then, the lineage will drop down when it reaches a certain length 
> (e.g., 33). 
> However, StackOverflow error still occurs after 300+ iterations!
> h3. Phase 2:  Abnormal f closure function leads to a unbreakable 
> serialization chain
> After a long-time debug, we found that an abnormal _*f*_ function closure and 
> a potential bug in GraphX (will be detailed in Phase 3) are the "Suspect 
> Zero". They together build another serialization chain that can bypass the 
> broken lineage cut by checkpoint() (as shown in Figure 1). In other words, 
> the serialization chain can be as long as the original lineage before 
> checkpoint().
> Figure 1 shows how the unbreakable serialization chain is generated. Yes, the 
> OneToOneDep can be cut off by checkpoint(). However, the serialization chain 
> can still access the previous RDDs through the (1)->(2) reference chain. As a 
> result, the checkpoint() action is meaningless and the lineage is as long as 
> that before. 
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%!
> The (1)->(2) chain can be observed in the debug view (in Figure 2).
> {code:borderStyle=solid}
> _rdd (i.e., A in Figure 1, checkpointed) -> f -> $outer (VertexRDD) -> 
> partitionsRDD:MapPartitionsRDD -> RDDs in  the previous iterations
> {code}
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g2.png|width=100%!
> More description: While a RDD is being serialized, its f function 
> {code:borderStyle=solid}
> e.g., f: (Iterator[A], Iterator[B

[jira] [Issue Comment Deleted] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-4672:
---
Comment: was deleted

(was: User 'JerryLead' has created a pull request for this issue:
https://github.com/apache/spark/pull/3537)

> Cut off the super long serialization chain in GraphX to avoid the 
> StackOverflow error
> -
>
> Key: SPARK-4672
> URL: https://issues.apache.org/jira/browse/SPARK-4672
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.1.0
>Reporter: Lijie Xu
>Priority: Critical
> Fix For: 1.2.0
>
>
> While running iterative algorithms in GraphX, a StackOverflow error will 
> stably occur in the serialization phase at about 300th iteration. In general, 
> these kinds of algorithms have two things in common:
> # They have a long computing chain.
> {code:borderStyle=solid}
> (e.g., “degreeGraph=>subGraph=>degreeGraph=>subGraph=>…=>”)
> {code}
> # They will iterate many times to converge. An example:
> {code:borderStyle=solid}
> //K-Core Algorithm
> val kNum = 5
> var degreeGraph = graph.outerJoinVertices(graph.degrees) {
>   (vid, vd, degree) => degree.getOrElse(0)
> }.cache()
>   
> do {
>   val subGraph = degreeGraph.subgraph(
>   vpred = (vid, degree) => degree >= KNum
>   ).cache()
>   val newDegreeGraph = subGraph.degrees
>   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
>   (vid, vd, degree) => degree.getOrElse(0)
>   }.cache()
>   isConverged = check(degreeGraph)
> } while(isConverged == false)
> {code}
> After about 300 iterations, StackOverflow will definitely occur with the 
> following stack trace:
> {code:borderStyle=solid}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task serialization failed: java.lang.StackOverflowError
> java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
> java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> {code}
> It is a very tricky bug, which only occurs with enough iterations. Since it 
> took us a long time to find out its causes, we will detail the causes in the 
> following 3 paragraphs. 
>  
> h3. Phase 1: Try using checkpoint() to shorten the lineage
> It's easy to come to the thought that the long lineage may be the cause. For 
> some RDDs, their lineages may grow with the iterations. Also, for some 
> magical references,  their lineage lengths never decrease and finally become 
> very long. As a result, the call stack of task's 
> serialization()/deserialization() method will be very long too, which finally 
> exhausts the whole JVM stack.
> In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
> OneToOne dependencies in each iteration in the above example. Lineage length 
> refers to the  maximum length of OneToOne dependencies (e.g., from the 
> finalRDD to the ShuffledRDD) in each stage.
> To shorten the lineage, a checkpoint() is performed every N (e.g., 10) 
> iterations. Then, the lineage will drop down when it reaches a certain length 
> (e.g., 33). 
> However, StackOverflow error still occurs after 300+ iterations!
> h3. Phase 2:  Abnormal f closure function leads to a unbreakable 
> serialization chain
> After a long-time debug, we found that an abnormal _*f*_ function closure and 
> a potential bug in GraphX (will be detailed in Phase 3) are the "Suspect 
> Zero". They together build another serialization chain that can bypass the 
> broken lineage cut by checkpoint() (as shown in Figure 1). In other words, 
> the serialization chain can be as long as the original lineage before 
> checkpoint().
> Figure 1 shows how the unbreakable serialization chain is generated. Yes, the 
> OneToOneDep can be cut off by checkpoint(). However, the serialization chain 
> can still access the previous RDDs through the (1)->(2) reference chain. As a 
> result, the checkpoint() action is meaningless and the lineage is as long as 
> that before. 
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%!
> The (1)->(2) chain can be observed in the debug view (in Figure 2).
> {code:borderStyle=solid}
> _rdd (i.e., A in Figure 1, checkpointed) -> f -> $outer (VertexRDD) -> 
> partitionsRDD:MapPartitionsRDD -> RDDs in  the previous iterations
> {code}
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g2.png|width=100%!
> More description: While a RDD is being serialized, its f function 
> {code:borderStyle=solid}
> e.g., f: (Iterator[A], Iterator[B

[jira] [Issue Comment Deleted] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-4672:
---
Comment: was deleted

(was: User 'JerryLead' has created a pull request for this issue:
https://github.com/apache/spark/pull/3544)

> Cut off the super long serialization chain in GraphX to avoid the 
> StackOverflow error
> -
>
> Key: SPARK-4672
> URL: https://issues.apache.org/jira/browse/SPARK-4672
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.1.0
>Reporter: Lijie Xu
>Priority: Critical
> Fix For: 1.2.0
>
>
> While running iterative algorithms in GraphX, a StackOverflow error will 
> stably occur in the serialization phase at about 300th iteration. In general, 
> these kinds of algorithms have two things in common:
> # They have a long computing chain.
> {code:borderStyle=solid}
> (e.g., “degreeGraph=>subGraph=>degreeGraph=>subGraph=>…=>”)
> {code}
> # They will iterate many times to converge. An example:
> {code:borderStyle=solid}
> //K-Core Algorithm
> val kNum = 5
> var degreeGraph = graph.outerJoinVertices(graph.degrees) {
>   (vid, vd, degree) => degree.getOrElse(0)
> }.cache()
>   
> do {
>   val subGraph = degreeGraph.subgraph(
>   vpred = (vid, degree) => degree >= KNum
>   ).cache()
>   val newDegreeGraph = subGraph.degrees
>   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
>   (vid, vd, degree) => degree.getOrElse(0)
>   }.cache()
>   isConverged = check(degreeGraph)
> } while(isConverged == false)
> {code}
> After about 300 iterations, StackOverflow will definitely occur with the 
> following stack trace:
> {code:borderStyle=solid}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task serialization failed: java.lang.StackOverflowError
> java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
> java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> {code}
> It is a very tricky bug, which only occurs with enough iterations. Since it 
> took us a long time to find out its causes, we will detail the causes in the 
> following 3 paragraphs. 
>  
> h3. Phase 1: Try using checkpoint() to shorten the lineage
> It's easy to come to the thought that the long lineage may be the cause. For 
> some RDDs, their lineages may grow with the iterations. Also, for some 
> magical references,  their lineage lengths never decrease and finally become 
> very long. As a result, the call stack of task's 
> serialization()/deserialization() method will be very long too, which finally 
> exhausts the whole JVM stack.
> In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
> OneToOne dependencies in each iteration in the above example. Lineage length 
> refers to the  maximum length of OneToOne dependencies (e.g., from the 
> finalRDD to the ShuffledRDD) in each stage.
> To shorten the lineage, a checkpoint() is performed every N (e.g., 10) 
> iterations. Then, the lineage will drop down when it reaches a certain length 
> (e.g., 33). 
> However, StackOverflow error still occurs after 300+ iterations!
> h3. Phase 2:  Abnormal f closure function leads to a unbreakable 
> serialization chain
> After a long-time debug, we found that an abnormal _*f*_ function closure and 
> a potential bug in GraphX (will be detailed in Phase 3) are the "Suspect 
> Zero". They together build another serialization chain that can bypass the 
> broken lineage cut by checkpoint() (as shown in Figure 1). In other words, 
> the serialization chain can be as long as the original lineage before 
> checkpoint().
> Figure 1 shows how the unbreakable serialization chain is generated. Yes, the 
> OneToOneDep can be cut off by checkpoint(). However, the serialization chain 
> can still access the previous RDDs through the (1)->(2) reference chain. As a 
> result, the checkpoint() action is meaningless and the lineage is as long as 
> that before. 
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%!
> The (1)->(2) chain can be observed in the debug view (in Figure 2).
> {code:borderStyle=solid}
> _rdd (i.e., A in Figure 1, checkpointed) -> f -> $outer (VertexRDD) -> 
> partitionsRDD:MapPartitionsRDD -> RDDs in  the previous iterations
> {code}
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g2.png|width=100%!
> More description: While a RDD is being serialized, its f function 
> {code:borderStyle=solid}
> e.g., f: (Iterator[A], Iterator[B

[jira] [Commented] (SPARK-4688) Have a single shared network timeout in Spark

2014-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232219#comment-14232219
 ] 

Apache Spark commented on SPARK-4688:
-

User 'varunsaxena' has created a pull request for this issue:
https://github.com/apache/spark/pull/3562

> Have a single shared network timeout in Spark
> -
>
> Key: SPARK-4688
> URL: https://issues.apache.org/jira/browse/SPARK-4688
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Priority: Critical
>
> We have several different timeouts, but in most cases users just want to set 
> something that is large enough that they can avoid GC pauses. We should have 
> a single conf "spark.network.timeout" that is used as the default timeout for 
> all network interactions. This can replace the following timeouts:
> {code}
> spark.core.connection.ack.wait.timeout
> spark.akka.timeout  
> spark.storage.blockManagerSlaveTimeoutMs  (undocumented)
> spark.shuffle.io.connectionTimeout (undocumented)
> {code}
> Of course, for compatibility we should respect the old ones when they are 
> set. This idea was proposed originally by [~rxin] and I'm paraphrasing his 
> suggestion here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1127) Add saveAsHBase to PairRDDFunctions

2014-12-02 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1423#comment-1423
 ] 

Ted Yu commented on SPARK-1127:
---

According to Reynold,
First half of the external data source API (for reading but not writing) is 
already in 1.2:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

> Add saveAsHBase to PairRDDFunctions
> ---
>
> Key: SPARK-1127
> URL: https://issues.apache.org/jira/browse/SPARK-1127
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: haosdent huang
>Assignee: haosdent huang
> Fix For: 1.2.0
>
>
> Support to save data in HBase.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4676) JavaSchemaRDD.schema may throw NullType MatchError if sql has null

2014-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4676.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3538
[https://github.com/apache/spark/pull/3538]

> JavaSchemaRDD.schema may throw NullType MatchError if sql has null
> --
>
> Key: SPARK-4676
> URL: https://issues.apache.org/jira/browse/SPARK-4676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: YanTang Zhai
> Fix For: 1.2.0
>
>
> val jsc = new org.apache.spark.api.java.JavaSparkContext(sc)
> val jhc = new org.apache.spark.sql.hive.api.java.JavaHiveContext(jsc)
> val nrdd = jhc.hql("select null from spark_test.for_test")
> println(nrdd.schema)
> Then the error is thrown as follows:
> scala.MatchError: NullType (of class 
> org.apache.spark.sql.catalyst.types.NullType$)
> at 
> org.apache.spark.sql.types.util.DataTypeConversions$.asJavaDataType(DataTypeConversions.scala:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4593) sum(1/0) would produce a very large number

2014-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4593.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3443
[https://github.com/apache/spark/pull/3443]

> sum(1/0) would produce a very large number
> --
>
> Key: SPARK-4593
> URL: https://issues.apache.org/jira/browse/SPARK-4593
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
>Priority: Minor
> Fix For: 1.2.0
>
>
> SELECT max(1/0) FROM src would get a very large number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4670) bitwise NOT has a wrong `toString` output

2014-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4670.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3528
[https://github.com/apache/spark/pull/3528]

> bitwise NOT has a wrong `toString` output
> -
>
> Key: SPARK-4670
> URL: https://issues.apache.org/jira/browse/SPARK-4670
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Adrian Wang
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4695) Get result using executeCollect in spark sql

2014-12-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4695.
-
   Resolution: Fixed
Fix Version/s: (was: 1.3.0)
   1.2.0

Issue resolved by pull request 3547
[https://github.com/apache/spark/pull/3547]

>  Get result using executeCollect in spark sql 
> --
>
> Key: SPARK-4695
> URL: https://issues.apache.org/jira/browse/SPARK-4695
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0
>
>
> We should use executeCollect to collect the result, because executeCollect is 
> a custom implementation of collect in spark sql which better than rdd's 
> collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled

2014-12-02 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-4705:
-

 Summary: Driver retries in yarn-cluster mode always fail if event 
logging is enabled
 Key: SPARK-4705
 URL: https://issues.apache.org/jira/browse/SPARK-4705
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin


yarn-cluster mode will retry to run the driver in certain failure modes. If 
even logging is enabled, this will most probably fail, because:

{noformat}
Exception in thread "Driver" java.io.IOException: Log directory 
hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003
 already exists!
at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129)
at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
at 
org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
at org.apache.spark.SparkContext.(SparkContext.scala:353)
{noformat}

The even log path should be "more unique". Or perhaps retries of the same app 
should clean up the old logs first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4706) Remove FakeParquetSerDe

2014-12-02 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-4706:
---

 Summary: Remove FakeParquetSerDe
 Key: SPARK-4706
 URL: https://issues.apache.org/jira/browse/SPARK-4706
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3431) Parallelize execution of tests

2014-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232309#comment-14232309
 ] 

Apache Spark commented on SPARK-3431:
-

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/3564

> Parallelize execution of tests
> --
>
> Key: SPARK-3431
> URL: https://issues.apache.org/jira/browse/SPARK-3431
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>
> Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common 
> strategy to cut test time down is to parallelize the execution of the tests. 
> Doing that may in turn require some prerequisite changes to be made to how 
> certain tests run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4707) Reliable Kafka Receiver can lose data if the block generator fails to store data

2014-12-02 Thread Hari Shreedharan (JIRA)
Hari Shreedharan created SPARK-4707:
---

 Summary: Reliable Kafka Receiver can lose data if the block 
generator fails to store data
 Key: SPARK-4707
 URL: https://issues.apache.org/jira/browse/SPARK-4707
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Hari Shreedharan


The Reliable Kafka Receiver commits offsets only when events are actually 
stored, which ensures that on restart we will actually start where we left off. 
But if the failure happens in the store() call, and the block generator reports 
an error the receiver does not do anything and will continue reading from the 
current offset and not the last commit. This means that messages between the 
last commit and the current offset will be lost. 

I will send a PR for this soon - I have a patch which needs some minor fixes, 
which I need to test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4525) MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers

2014-12-02 Thread Jongyoul Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jongyoul Lee updated SPARK-4525:

Shepherd:   (was: Andrew Or)

> MesosSchedulerBackend.resourceOffers cannot decline unused offers from 
> acceptedOffers
> -
>
> Key: SPARK-4525
> URL: https://issues.apache.org/jira/browse/SPARK-4525
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Jongyoul Lee
>Assignee: Jongyoul Lee
>Priority: Blocker
> Fix For: 1.2.0
>
>
> After resourceOffers function is refactored - SPARK-2269 -, that function 
> doesn't decline unused offers from accepted offers. That's because when 
> driver.launchTasks is called, if that's tasks is empty, driver.launchTask 
> calls the declineOffer(offer.id). 
> {quote}
> Invoking this function with an empty collection of tasks declines offers in 
> their entirety (see SchedulerDriver.declineOffer(OfferID, Filters)).
> - 
> http://mesos.apache.org/api/latest/java/org/apache/mesos/MesosSchedulerDriver.html#launchTasks(OfferID,%20java.util.Collection,%20Filters)
> {quote}
> In branch-1.1, resourcesOffers calls a launchTask function for all offered 
> offers, so driver declines unused resources, however, in current master, at 
> first offers are divided accepted and declined offers by their resources, and 
> delinedOffers are declined explicitly, and offers with task from 
> acceptedOffers are launched by driver.launchTasks, but, offers without from 
> acceptedOfers are not launched with empty task or declined explicitly. Thus, 
> mesos master judges thats offers used by TaskScheduler and there are no 
> resources remaing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4575) Documentation for the pipeline features

2014-12-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4575:
-
Assignee: Joseph K. Bradley  (was: Xiangrui Meng)

> Documentation for the pipeline features
> ---
>
> Key: SPARK-4575
> URL: https://issues.apache.org/jira/browse/SPARK-4575
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.2.0
>Reporter: Xiangrui Meng
>Assignee: Joseph K. Bradley
>
> Add user guide for the newly added ML pipeline feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4707) Reliable Kafka Receiver can lose data if the block generator fails to store data

2014-12-02 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232419#comment-14232419
 ] 

Saisai Shao commented on SPARK-4707:


Thanks Hari for this fix. So a basic question, how to treat the data when 
store() is failed, shall we stop receiving the data from Kafka?

> Reliable Kafka Receiver can lose data if the block generator fails to store 
> data
> 
>
> Key: SPARK-4707
> URL: https://issues.apache.org/jira/browse/SPARK-4707
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Hari Shreedharan
>
> The Reliable Kafka Receiver commits offsets only when events are actually 
> stored, which ensures that on restart we will actually start where we left 
> off. But if the failure happens in the store() call, and the block generator 
> reports an error the receiver does not do anything and will continue reading 
> from the current offset and not the last commit. This means that messages 
> between the last commit and the current offset will be lost. 
> I will send a PR for this soon - I have a patch which needs some minor fixes, 
> which I need to test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4708) k-mean runs two/three times faster with dense/sparse sample

2014-12-02 Thread DB Tsai (JIRA)
DB Tsai created SPARK-4708:
--

 Summary: k-mean runs two/three times faster with dense/sparse 
sample
 Key: SPARK-4708
 URL: https://issues.apache.org/jira/browse/SPARK-4708
 Project: Spark
  Issue Type: Improvement
Reporter: DB Tsai


Note that the usage of `breezeSquaredDistance` in 
`org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical 
path, and breezeSquaredDistance is slow. We should replace it with our own 
implementation.

Here is the benchmark against mnist8m dataset.

Before
DenseVector: 70.04secs
SparseVector: 59.05secs

With this PR
DenseVector: 30.58secs
SparseVector: 21.14secs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4708) Make k-mean runs two/three times faster with dense/sparse sample

2014-12-02 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-4708:
---
Component/s: MLlib

> Make k-mean runs two/three times faster with dense/sparse sample
> 
>
> Key: SPARK-4708
> URL: https://issues.apache.org/jira/browse/SPARK-4708
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: DB Tsai
>
> Note that the usage of `breezeSquaredDistance` in 
> `org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical 
> path, and breezeSquaredDistance is slow. We should replace it with our own 
> implementation.
> Here is the benchmark against mnist8m dataset.
> Before
> DenseVector: 70.04secs
> SparseVector: 59.05secs
> With this PR
> DenseVector: 30.58secs
> SparseVector: 21.14secs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4708) Make k-mean runs two/three times faster with dense/sparse sample

2014-12-02 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-4708:
---
Summary: Make k-mean runs two/three times faster with dense/sparse sample  
(was: k-mean runs two/three times faster with dense/sparse sample)

> Make k-mean runs two/three times faster with dense/sparse sample
> 
>
> Key: SPARK-4708
> URL: https://issues.apache.org/jira/browse/SPARK-4708
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: DB Tsai
>
> Note that the usage of `breezeSquaredDistance` in 
> `org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical 
> path, and breezeSquaredDistance is slow. We should replace it with our own 
> implementation.
> Here is the benchmark against mnist8m dataset.
> Before
> DenseVector: 70.04secs
> SparseVector: 59.05secs
> With this PR
> DenseVector: 30.58secs
> SparseVector: 21.14secs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4707) Reliable Kafka Receiver can lose data if the block generator fails to store data

2014-12-02 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232425#comment-14232425
 ] 

Hari Shreedharan commented on SPARK-4707:
-

No, not really. There is really only one case where we'd lose data - that is 
when the store fails and the receiver is still active. There are two ways of 
working around this:
* Kill the consumer without committing the offsets and start a new consumer 
which will start reading data from the last commit (this is the easiest one, 
but is sort of expensive to create new consumers and also causes duplicates due 
to rebalancing).
* In the second option, store all of the pending messages in an ordered buffer 
locally in the receiver and try to push the data again on failure (on success 
just clear the buffer and commit). Finally, once the data is pushed commit the 
offset and start reading from Kafka again (commit offsets only when there are 
no pending messages). To make this smarter, we can keep track of how many 
messages are each block for each topic and partition and commit



> Reliable Kafka Receiver can lose data if the block generator fails to store 
> data
> 
>
> Key: SPARK-4707
> URL: https://issues.apache.org/jira/browse/SPARK-4707
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Hari Shreedharan
>
> The Reliable Kafka Receiver commits offsets only when events are actually 
> stored, which ensures that on restart we will actually start where we left 
> off. But if the failure happens in the store() call, and the block generator 
> reports an error the receiver does not do anything and will continue reading 
> from the current offset and not the last commit. This means that messages 
> between the last commit and the current offset will be lost. 
> I will send a PR for this soon - I have a patch which needs some minor fixes, 
> which I need to test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4708) Make k-mean runs two/three times faster with dense/sparse sample

2014-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232428#comment-14232428
 ] 

Apache Spark commented on SPARK-4708:
-

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/3565

> Make k-mean runs two/three times faster with dense/sparse sample
> 
>
> Key: SPARK-4708
> URL: https://issues.apache.org/jira/browse/SPARK-4708
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: DB Tsai
>
> Note that the usage of `breezeSquaredDistance` in 
> `org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical 
> path, and breezeSquaredDistance is slow. We should replace it with our own 
> implementation.
> Here is the benchmark against mnist8m dataset.
> Before
> DenseVector: 70.04secs
> SparseVector: 59.05secs
> With this PR
> DenseVector: 30.58secs
> SparseVector: 21.14secs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2710) Build SchemaRDD from a JdbcRDD with MetaData (no hard-coded case class)

2014-12-02 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232431#comment-14232431
 ] 

Michael Armbrust commented on SPARK-2710:
-

I'd suggest looking at the test cases (which have examples of each of the 
different interfaces): 
https://github.com/apache/spark/tree/master/sql/core/src/test/scala/org/apache/spark/sql/sources

Additionally, I have put together a sample library here that reads Avro data: 
https://github.com/databricks/spark-avro

> Build SchemaRDD from a JdbcRDD with MetaData (no hard-coded case class)
> ---
>
> Key: SPARK-2710
> URL: https://issues.apache.org/jira/browse/SPARK-2710
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Reporter: Teng Qiu
>
> Spark SQL can take Parquet files or JSON files as a table directly (without 
> given a case class to define the schema)
> as a component named SQL, it should also be able to take a ResultSet from 
> RDBMS easily.
> i find that there is a JdbcRDD in core: 
> core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
> so i want to make some small change in this file to allow SQLContext to read 
> the MetaData from the PreparedStatement (read metadata do not need to execute 
> the query really).
> Then, in Spark SQL, SQLContext can create SchemaRDD with JdbcRDD and his 
> MetaData.
> In the further, maybe we can add a feature in sql-shell, so that user can 
> using spark-thrift-server join tables from different sources
> such as:
> {code}
> CREATE TABLE jdbc_tbl1 AS JDBC "connectionString" "username" "password" 
> "initQuery" "bound" ...
> CREATE TABLE parquet_files AS PARQUET "hdfs://tmp/parquet_table/"
> SELECT parquet_files.colX, jdbc_tbl1.colY
>   FROM parquet_files
>   JOIN jdbc_tbl1
> ON (parquet_files.id = jdbc_tbl1.id)
> {code}
> I think such a feature will be useful, like facebook Presto engine does.
> oh, and there is a small bug in JdbcRDD
> in compute(), method close()
> {code}
> if (null != conn && ! stmt.isClosed()) conn.close()
> {code}
> should be
> {code}
> if (null != conn && ! conn.isClosed()) conn.close()
> {code}
> just a small write error :)
> but such a close method will never be able to close conn...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4707) Reliable Kafka Receiver can lose data if the block generator fails to store data

2014-12-02 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232451#comment-14232451
 ] 

Saisai Shao commented on SPARK-4707:


Seems the second choice is cool, looking forward to your patch:).

> Reliable Kafka Receiver can lose data if the block generator fails to store 
> data
> 
>
> Key: SPARK-4707
> URL: https://issues.apache.org/jira/browse/SPARK-4707
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Hari Shreedharan
>
> The Reliable Kafka Receiver commits offsets only when events are actually 
> stored, which ensures that on restart we will actually start where we left 
> off. But if the failure happens in the store() call, and the block generator 
> reports an error the receiver does not do anything and will continue reading 
> from the current offset and not the last commit. This means that messages 
> between the last commit and the current offset will be lost. 
> I will send a PR for this soon - I have a patch which needs some minor fixes, 
> which I need to test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4709) Spark SQL support for Parquet with timestamp type field

2014-12-02 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-4709:
---

 Summary: Spark SQL support for Parquet with timestamp type field
 Key: SPARK-4709
 URL: https://issues.apache.org/jira/browse/SPARK-4709
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Felix Cheung
Priority: Critical


Have a data set on Parquet format (created by Hive) with a field of the 
timestamp type. Reading this causes an exception:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val p = sqlContext.parquetFile("hdfs:///data/parquetdata")

java.lang.RuntimeException: Potential loss of precision: cannot convert INT96
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441)
at 
org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:66)
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:17)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:22)
at $iwC$$iwC$$iwC$$iwC.(:24)
at $iwC$$iwC$$iwC.(:26)
at $iwC$$iwC.(:28)
at $iwC.(:30)
at (:32)
at .(:36)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:846)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1119)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:672)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:703)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:667)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:819)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:864)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:776)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:619)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:627)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:632)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:959)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:907)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1002)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:331)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)




--
This message was sent b

[jira] [Updated] (SPARK-4709) Spark SQL support error reading Parquet with timestamp type field

2014-12-02 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-4709:

Summary: Spark SQL support error reading Parquet with timestamp type field  
(was: Spark SQL support for Parquet with timestamp type field)

> Spark SQL support error reading Parquet with timestamp type field
> -
>
> Key: SPARK-4709
> URL: https://issues.apache.org/jira/browse/SPARK-4709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Felix Cheung
>Priority: Critical
>
> Have a data set on Parquet format (created by Hive) with a field of the 
> timestamp type. Reading this causes an exception:
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val p = sqlContext.parquetFile("hdfs:///data/parquetdata")
> java.lang.RuntimeException: Potential loss of precision: cannot convert INT96
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation.(ParquetRelation.scala:66)
>   at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:17)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:22)
>   at $iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC.(:28)
>   at $iwC.(:30)
>   at (:32)
>   at .(:36)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:846)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1119)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:672)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:703)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:667)
>   at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:819)
>   at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:864)
>   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:776)
>   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:619)
>   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:627)
>   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:632)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:959)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907)
>   at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907)
>   at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:907)
>   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1002)
>   at org.apache.spark.repl.Main$.main(Main.scala:31)
>   at org.apache.spark.repl.Main.main(Main.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect

[jira] [Updated] (SPARK-4707) Reliable Kafka Receiver can lose data if the block generator fails to store data

2014-12-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4707:
---
Priority: Critical  (was: Major)

> Reliable Kafka Receiver can lose data if the block generator fails to store 
> data
> 
>
> Key: SPARK-4707
> URL: https://issues.apache.org/jira/browse/SPARK-4707
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Hari Shreedharan
>Priority: Critical
>
> The Reliable Kafka Receiver commits offsets only when events are actually 
> stored, which ensures that on restart we will actually start where we left 
> off. But if the failure happens in the store() call, and the block generator 
> reports an error the receiver does not do anything and will continue reading 
> from the current offset and not the last commit. This means that messages 
> between the last commit and the current offset will be lost. 
> I will send a PR for this soon - I have a patch which needs some minor fixes, 
> which I need to test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-874) Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are finished

2014-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232504#comment-14232504
 ] 

Apache Spark commented on SPARK-874:


User 'jbencook' has created a pull request for this issue:
https://github.com/apache/spark/pull/3567

> Have a --wait flag in ./sbin/stop-all.sh that polls until Worker's are 
> finished
> ---
>
> Key: SPARK-874
> URL: https://issues.apache.org/jira/browse/SPARK-874
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Reporter: Patrick Wendell
>Assignee: Archit Thakur
>Priority: Minor
>  Labels: starter
> Fix For: 1.2.0
>
>
> When running benchmarking jobs, sometimes the cluster takes a long time to 
> shut down. We should add a feature where it will ssh into all the workers 
> every few seconds and check that the processes are dead, and won't return 
> until they are all dead. This would help a lot with automating benchmarking 
> scripts.
> There is some equivalent logic here written in python, we just need to add it 
> to the shell script:
> https://github.com/pwendell/spark-perf/blob/master/bin/run#L117



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4644) Implement skewed join

2014-12-02 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232522#comment-14232522
 ] 

Shixiong Zhu commented on SPARK-4644:
-

Looks `groupByKey` is really different from `join`. The signature of 
`groupByKey` is `def groupByKey(partitioner: Partitioner): RDD[(K, 
Iterable[V])]`, the return value is `RDD[(K, Iterable[V])]`. It exposes the 
internal data structure as `Iterable` to the user, and user can write 
`rdd.groupByKey().repartition(5)`. Therefore, `Iterable` returned by 
`groupByKey` needs to be `Serializable` and can be used in other nodes.

`ChunkBuffer` I designed for skewed join is only used internally and won't be 
exposed to the user. So now it's not `Serializable` and cannot be used by 
`groupByKey`.

In summary, we need a special `Iterable` for `groupByKey`: it can write to disk 
if there is in insufficient space; it can be used in any node, which means this 
`Iterable` can access other nodes' disk (maybe via BlockManager?). Therefore, 
for now I cannot find a general approach both for `join` and `groupByKey`.

> Implement skewed join
> -
>
> Key: SPARK-4644
> URL: https://issues.apache.org/jira/browse/SPARK-4644
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
> Attachments: Skewed Join Design Doc.pdf
>
>
> Skewed data is not rare. For example, a book recommendation site may have 
> several books which are liked by most of the users. Running ALS on such 
> skewed data will raise a OutOfMemory error, if some book has too many users 
> which cannot be fit into memory. To solve it, we propose a skewed join 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-02 Thread Jason Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232524#comment-14232524
 ] 

Jason Dai commented on SPARK-4672:
--

We ran into the same issue, and this is a nice summary for the bug analysis. On 
the other hand, while this may fix the specific GraphX issue, I don't think it 
is generally applicable for dealing with super long lineage that can be 
generated in GraphX or other iterative algorithms. 

In particular, the user can define arbitrary functions, which can be called in 
RDD.compute() and refer to an arbitrary member variable that is an RDD, or can 
be used to construct another RDD, such as:
{noformat}
  class MyRDD (val rdd1, val rdd2, func1) extends RDD {
val func2 = (f, iter1, iter2) => iter1– f(iter2)
…
override def compute(part, sc) {
  func2(func1, rdd1.iterator(part, sc), rdd2.iterator(part, sc))
}
…
   define newRDD(val rdd3, func3) = {
 val func4 = func2(func3)
 new AnotherRDD() {
   override def compute(part, sc) {
 func4(rdd1.iterator(part, sc) + rdd2.iterator(part, sc), 
rdd3.iterator(part, sc))
   }  
}
  }
}
{noformat}
In this case, we will need to serialize rdd1 and rdd2 before MyRDD is 
checkpointed; after MyRDD is checkpointed, we don’t need to serialize rdd1 or 
rdd2, but we cannot clear func2 either. 

I think we can fix this more general issues as follows:
# As only RDD.compute(or RDD.iterator) should be called at the worker side, we 
only need to serialize anything that is referenced in that function (no matter 
it’s a member variable or not)
# After the RDD is checkpointed, the RDD.compute should be changed to read the 
checkpint file, which will not reference other variables – again, we only need 
to serialize whatever is referenced in that function now


> Cut off the super long serialization chain in GraphX to avoid the 
> StackOverflow error
> -
>
> Key: SPARK-4672
> URL: https://issues.apache.org/jira/browse/SPARK-4672
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.1.0
>Reporter: Lijie Xu
>Priority: Critical
> Fix For: 1.2.0
>
>
> While running iterative algorithms in GraphX, a StackOverflow error will 
> stably occur in the serialization phase at about 300th iteration. In general, 
> these kinds of algorithms have two things in common:
> # They have a long computing chain.
> {code:borderStyle=solid}
> (e.g., “degreeGraph=>subGraph=>degreeGraph=>subGraph=>…=>”)
> {code}
> # They will iterate many times to converge. An example:
> {code:borderStyle=solid}
> //K-Core Algorithm
> val kNum = 5
> var degreeGraph = graph.outerJoinVertices(graph.degrees) {
>   (vid, vd, degree) => degree.getOrElse(0)
> }.cache()
>   
> do {
>   val subGraph = degreeGraph.subgraph(
>   vpred = (vid, degree) => degree >= KNum
>   ).cache()
>   val newDegreeGraph = subGraph.degrees
>   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
>   (vid, vd, degree) => degree.getOrElse(0)
>   }.cache()
>   isConverged = check(degreeGraph)
> } while(isConverged == false)
> {code}
> After about 300 iterations, StackOverflow will definitely occur with the 
> following stack trace:
> {code:borderStyle=solid}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task serialization failed: java.lang.StackOverflowError
> java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
> java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> {code}
> It is a very tricky bug, which only occurs with enough iterations. Since it 
> took us a long time to find out its causes, we will detail the causes in the 
> following 3 paragraphs. 
>  
> h3. Phase 1: Try using checkpoint() to shorten the lineage
> It's easy to come to the thought that the long lineage may be the cause. For 
> some RDDs, their lineages may grow with the iterations. Also, for some 
> magical references,  their lineage lengths never decrease and finally become 
> very long. As a result, the call stack of task's 
> serialization()/deserialization() method will be very long too, which finally 
> exhausts the whole JVM stack.
> In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
> OneToOne dependencies in each iteration in the above example. Lineage length 
> refers to the  maximum length of OneToOne dependencies (e.g., from the 
> finalRDD to the ShuffledRDD) in each stage.
> To shorten the lineage, a checkpoint() is performed every N (e.

[jira] [Commented] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-02 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232527#comment-14232527
 ] 

Reynold Xin commented on SPARK-4672:


Yea it makes sense to remove all the function closure f from an RDD if it is 
checkpointed.


> Cut off the super long serialization chain in GraphX to avoid the 
> StackOverflow error
> -
>
> Key: SPARK-4672
> URL: https://issues.apache.org/jira/browse/SPARK-4672
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.1.0
>Reporter: Lijie Xu
>Priority: Critical
> Fix For: 1.2.0
>
>
> While running iterative algorithms in GraphX, a StackOverflow error will 
> stably occur in the serialization phase at about 300th iteration. In general, 
> these kinds of algorithms have two things in common:
> # They have a long computing chain.
> {code:borderStyle=solid}
> (e.g., “degreeGraph=>subGraph=>degreeGraph=>subGraph=>…=>”)
> {code}
> # They will iterate many times to converge. An example:
> {code:borderStyle=solid}
> //K-Core Algorithm
> val kNum = 5
> var degreeGraph = graph.outerJoinVertices(graph.degrees) {
>   (vid, vd, degree) => degree.getOrElse(0)
> }.cache()
>   
> do {
>   val subGraph = degreeGraph.subgraph(
>   vpred = (vid, degree) => degree >= KNum
>   ).cache()
>   val newDegreeGraph = subGraph.degrees
>   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
>   (vid, vd, degree) => degree.getOrElse(0)
>   }.cache()
>   isConverged = check(degreeGraph)
> } while(isConverged == false)
> {code}
> After about 300 iterations, StackOverflow will definitely occur with the 
> following stack trace:
> {code:borderStyle=solid}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task serialization failed: java.lang.StackOverflowError
> java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
> java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> {code}
> It is a very tricky bug, which only occurs with enough iterations. Since it 
> took us a long time to find out its causes, we will detail the causes in the 
> following 3 paragraphs. 
>  
> h3. Phase 1: Try using checkpoint() to shorten the lineage
> It's easy to come to the thought that the long lineage may be the cause. For 
> some RDDs, their lineages may grow with the iterations. Also, for some 
> magical references,  their lineage lengths never decrease and finally become 
> very long. As a result, the call stack of task's 
> serialization()/deserialization() method will be very long too, which finally 
> exhausts the whole JVM stack.
> In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
> OneToOne dependencies in each iteration in the above example. Lineage length 
> refers to the  maximum length of OneToOne dependencies (e.g., from the 
> finalRDD to the ShuffledRDD) in each stage.
> To shorten the lineage, a checkpoint() is performed every N (e.g., 10) 
> iterations. Then, the lineage will drop down when it reaches a certain length 
> (e.g., 33). 
> However, StackOverflow error still occurs after 300+ iterations!
> h3. Phase 2:  Abnormal f closure function leads to a unbreakable 
> serialization chain
> After a long-time debug, we found that an abnormal _*f*_ function closure and 
> a potential bug in GraphX (will be detailed in Phase 3) are the "Suspect 
> Zero". They together build another serialization chain that can bypass the 
> broken lineage cut by checkpoint() (as shown in Figure 1). In other words, 
> the serialization chain can be as long as the original lineage before 
> checkpoint().
> Figure 1 shows how the unbreakable serialization chain is generated. Yes, the 
> OneToOneDep can be cut off by checkpoint(). However, the serialization chain 
> can still access the previous RDDs through the (1)->(2) reference chain. As a 
> result, the checkpoint() action is meaningless and the lineage is as long as 
> that before. 
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%!
> The (1)->(2) chain can be observed in the debug view (in Figure 2).
> {code:borderStyle=solid}
> _rdd (i.e., A in Figure 1, checkpointed) -> f -> $outer (VertexRDD) -> 
> partitionsRDD:MapPartitionsRDD -> RDDs in  the previous iterations
> {code}
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g2.png|width=100%!
> More description: While a RDD is being serialized, its f function 
> {code:borderStyle=solid}
> e.g., f: (Iterato

[jira] [Commented] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-02 Thread Jason Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232529#comment-14232529
 ] 

Jason Dai commented on SPARK-4672:
--

[~rxin] what exactly do you mean by "remove all the function closure f from an 
RDD if it is checkpointed"?

In my previous example, we should not clear func2 even if MyRDD is 
checkpointed, otherwise newRDD() will be no longer correct. Instead, we should 
make sure we only include RDD.compute(or RDD.iterator) in the closure (no 
matter whether it is checkpointed or not), and change RDD.compute to reading 
checkpoint files once it is checkpointed.

> Cut off the super long serialization chain in GraphX to avoid the 
> StackOverflow error
> -
>
> Key: SPARK-4672
> URL: https://issues.apache.org/jira/browse/SPARK-4672
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.1.0
>Reporter: Lijie Xu
>Priority: Critical
> Fix For: 1.2.0
>
>
> While running iterative algorithms in GraphX, a StackOverflow error will 
> stably occur in the serialization phase at about 300th iteration. In general, 
> these kinds of algorithms have two things in common:
> # They have a long computing chain.
> {code:borderStyle=solid}
> (e.g., “degreeGraph=>subGraph=>degreeGraph=>subGraph=>…=>”)
> {code}
> # They will iterate many times to converge. An example:
> {code:borderStyle=solid}
> //K-Core Algorithm
> val kNum = 5
> var degreeGraph = graph.outerJoinVertices(graph.degrees) {
>   (vid, vd, degree) => degree.getOrElse(0)
> }.cache()
>   
> do {
>   val subGraph = degreeGraph.subgraph(
>   vpred = (vid, degree) => degree >= KNum
>   ).cache()
>   val newDegreeGraph = subGraph.degrees
>   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
>   (vid, vd, degree) => degree.getOrElse(0)
>   }.cache()
>   isConverged = check(degreeGraph)
> } while(isConverged == false)
> {code}
> After about 300 iterations, StackOverflow will definitely occur with the 
> following stack trace:
> {code:borderStyle=solid}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task serialization failed: java.lang.StackOverflowError
> java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
> java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> {code}
> It is a very tricky bug, which only occurs with enough iterations. Since it 
> took us a long time to find out its causes, we will detail the causes in the 
> following 3 paragraphs. 
>  
> h3. Phase 1: Try using checkpoint() to shorten the lineage
> It's easy to come to the thought that the long lineage may be the cause. For 
> some RDDs, their lineages may grow with the iterations. Also, for some 
> magical references,  their lineage lengths never decrease and finally become 
> very long. As a result, the call stack of task's 
> serialization()/deserialization() method will be very long too, which finally 
> exhausts the whole JVM stack.
> In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
> OneToOne dependencies in each iteration in the above example. Lineage length 
> refers to the  maximum length of OneToOne dependencies (e.g., from the 
> finalRDD to the ShuffledRDD) in each stage.
> To shorten the lineage, a checkpoint() is performed every N (e.g., 10) 
> iterations. Then, the lineage will drop down when it reaches a certain length 
> (e.g., 33). 
> However, StackOverflow error still occurs after 300+ iterations!
> h3. Phase 2:  Abnormal f closure function leads to a unbreakable 
> serialization chain
> After a long-time debug, we found that an abnormal _*f*_ function closure and 
> a potential bug in GraphX (will be detailed in Phase 3) are the "Suspect 
> Zero". They together build another serialization chain that can bypass the 
> broken lineage cut by checkpoint() (as shown in Figure 1). In other words, 
> the serialization chain can be as long as the original lineage before 
> checkpoint().
> Figure 1 shows how the unbreakable serialization chain is generated. Yes, the 
> OneToOneDep can be cut off by checkpoint(). However, the serialization chain 
> can still access the previous RDDs through the (1)->(2) reference chain. As a 
> result, the checkpoint() action is meaningless and the lineage is as long as 
> that before. 
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%!
> The (1)->(2) chain can be observed in the debug view (in Figure 2).
> {code:borderStyle=solid}
> _rd

[jira] [Created] (SPARK-4710) Fix MLlib compilation warnings

2014-12-02 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-4710:


 Summary: Fix MLlib compilation warnings
 Key: SPARK-4710
 URL: https://issues.apache.org/jira/browse/SPARK-4710
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Trivial


MLlib has 2 compilation warnings from DecisionTreeRunner and StreamingKMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4710) Fix MLlib compilation warnings

2014-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232534#comment-14232534
 ] 

Apache Spark commented on SPARK-4710:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/3568

> Fix MLlib compilation warnings
> --
>
> Key: SPARK-4710
> URL: https://issues.apache.org/jira/browse/SPARK-4710
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> MLlib has 2 compilation warnings from DecisionTreeRunner and StreamingKMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-02 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232533#comment-14232533
 ] 

Reynold Xin commented on SPARK-4672:


Ok I admit I wasn't reading your comment too carefully :)

Is there a concrete way you are proposing that would solve this problem for 
arbitrarily defined RDDs? I don't think it is solvable at this point.

That said, we can solve this for most of the built-in RDDs.




> Cut off the super long serialization chain in GraphX to avoid the 
> StackOverflow error
> -
>
> Key: SPARK-4672
> URL: https://issues.apache.org/jira/browse/SPARK-4672
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.1.0
>Reporter: Lijie Xu
>Priority: Critical
> Fix For: 1.2.0
>
>
> While running iterative algorithms in GraphX, a StackOverflow error will 
> stably occur in the serialization phase at about 300th iteration. In general, 
> these kinds of algorithms have two things in common:
> # They have a long computing chain.
> {code:borderStyle=solid}
> (e.g., “degreeGraph=>subGraph=>degreeGraph=>subGraph=>…=>”)
> {code}
> # They will iterate many times to converge. An example:
> {code:borderStyle=solid}
> //K-Core Algorithm
> val kNum = 5
> var degreeGraph = graph.outerJoinVertices(graph.degrees) {
>   (vid, vd, degree) => degree.getOrElse(0)
> }.cache()
>   
> do {
>   val subGraph = degreeGraph.subgraph(
>   vpred = (vid, degree) => degree >= KNum
>   ).cache()
>   val newDegreeGraph = subGraph.degrees
>   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
>   (vid, vd, degree) => degree.getOrElse(0)
>   }.cache()
>   isConverged = check(degreeGraph)
> } while(isConverged == false)
> {code}
> After about 300 iterations, StackOverflow will definitely occur with the 
> following stack trace:
> {code:borderStyle=solid}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task serialization failed: java.lang.StackOverflowError
> java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
> java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> {code}
> It is a very tricky bug, which only occurs with enough iterations. Since it 
> took us a long time to find out its causes, we will detail the causes in the 
> following 3 paragraphs. 
>  
> h3. Phase 1: Try using checkpoint() to shorten the lineage
> It's easy to come to the thought that the long lineage may be the cause. For 
> some RDDs, their lineages may grow with the iterations. Also, for some 
> magical references,  their lineage lengths never decrease and finally become 
> very long. As a result, the call stack of task's 
> serialization()/deserialization() method will be very long too, which finally 
> exhausts the whole JVM stack.
> In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
> OneToOne dependencies in each iteration in the above example. Lineage length 
> refers to the  maximum length of OneToOne dependencies (e.g., from the 
> finalRDD to the ShuffledRDD) in each stage.
> To shorten the lineage, a checkpoint() is performed every N (e.g., 10) 
> iterations. Then, the lineage will drop down when it reaches a certain length 
> (e.g., 33). 
> However, StackOverflow error still occurs after 300+ iterations!
> h3. Phase 2:  Abnormal f closure function leads to a unbreakable 
> serialization chain
> After a long-time debug, we found that an abnormal _*f*_ function closure and 
> a potential bug in GraphX (will be detailed in Phase 3) are the "Suspect 
> Zero". They together build another serialization chain that can bypass the 
> broken lineage cut by checkpoint() (as shown in Figure 1). In other words, 
> the serialization chain can be as long as the original lineage before 
> checkpoint().
> Figure 1 shows how the unbreakable serialization chain is generated. Yes, the 
> OneToOneDep can be cut off by checkpoint(). However, the serialization chain 
> can still access the previous RDDs through the (1)->(2) reference chain. As a 
> result, the checkpoint() action is meaningless and the lineage is as long as 
> that before. 
> !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%!
> The (1)->(2) chain can be observed in the debug view (in Figure 2).
> {code:borderStyle=solid}
> _rdd (i.e., A in Figure 1, checkpointed) -> f -> $outer (VertexRDD) -> 
> partitionsRDD:MapPartitionsRDD -> RDDs in  the previous iterations
> {code}
> !https://raw.githubuserconten

[jira] [Created] (SPARK-4711) MLlib optimization: docs should suggest how to choose optimizer

2014-12-02 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-4711:


 Summary: MLlib optimization: docs should suggest how to choose 
optimizer
 Key: SPARK-4711
 URL: https://issues.apache.org/jira/browse/SPARK-4711
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Trivial


I have heard requests for the docs to include advice about choosing an 
optimization method.  The programming guide could include a brief statement 
about this (so the user does not have to read the whole optimization section).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4711) MLlib optimization: docs should suggest how to choose optimizer

2014-12-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232536#comment-14232536
 ] 

Apache Spark commented on SPARK-4711:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/3569

> MLlib optimization: docs should suggest how to choose optimizer
> ---
>
> Key: SPARK-4711
> URL: https://issues.apache.org/jira/browse/SPARK-4711
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> I have heard requests for the docs to include advice about choosing an 
> optimization method.  The programming guide could include a brief statement 
> about this (so the user does not have to read the whole optimization section).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4712) uploading jar when set spark.yarn.jar

2014-12-02 Thread Hong Shen (JIRA)
Hong Shen created SPARK-4712:


 Summary: uploading jar when set spark.yarn.jar 
 Key: SPARK-4712
 URL: https://issues.apache.org/jira/browse/SPARK-4712
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Hong Shen


when I set
spark.yarn.jar 
hdfs://mycluster/user/tdw/spark/d03/spark-assembly-1.1.0-hadoop2.2.0.jar

spark app will uploading jar,
2014-12-03 10:34:41,241 INFO yarn.Client (Logging.scala:logInfo(59)) - 
Uploading 
hdfs://mycluster/user/tdw/spark/d03/spark-assembly-1.1.0-hadoop2.2.0.jar to 
hdfs://mycluster/user/yarn/.sparkStaging/application_1417501428315_1544/spark-assembly-1.1.0-hadoop2.2.0.jar




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4712) uploading jar when set spark.yarn.jar

2014-12-02 Thread Hong Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232543#comment-14232543
 ] 

Hong Shen commented on SPARK-4712:
--

The reason is the HDFS is HA mode, ant spark client don't recognize 
“mycluster”, when InetAddress.getByName(srcHost).getCanonicalHostName(), will 
throw UnknownHostException. 
I will add a patch to fix it.

> uploading jar when set spark.yarn.jar 
> --
>
> Key: SPARK-4712
> URL: https://issues.apache.org/jira/browse/SPARK-4712
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Hong Shen
>
> when I set
> spark.yarn.jar 
> hdfs://mycluster/user/tdw/spark/d03/spark-assembly-1.1.0-hadoop2.2.0.jar
> spark app will uploading jar,
> 2014-12-03 10:34:41,241 INFO yarn.Client (Logging.scala:logInfo(59)) - 
> Uploading 
> hdfs://mycluster/user/tdw/spark/d03/spark-assembly-1.1.0-hadoop2.2.0.jar to 
> hdfs://mycluster/user/yarn/.sparkStaging/application_1417501428315_1544/spark-assembly-1.1.0-hadoop2.2.0.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4712) uploading jar when set spark.yarn.jar

2014-12-02 Thread Hong Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen closed SPARK-4712.

Resolution: Won't Fix

> uploading jar when set spark.yarn.jar 
> --
>
> Key: SPARK-4712
> URL: https://issues.apache.org/jira/browse/SPARK-4712
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Hong Shen
>
> when I set
> spark.yarn.jar 
> hdfs://mycluster/user/tdw/spark/d03/spark-assembly-1.1.0-hadoop2.2.0.jar
> spark app will uploading jar,
> 2014-12-03 10:34:41,241 INFO yarn.Client (Logging.scala:logInfo(59)) - 
> Uploading 
> hdfs://mycluster/user/tdw/spark/d03/spark-assembly-1.1.0-hadoop2.2.0.jar to 
> hdfs://mycluster/user/yarn/.sparkStaging/application_1417501428315_1544/spark-assembly-1.1.0-hadoop2.2.0.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-02 Thread Jason Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232549#comment-14232549
 ] 

Jason Dai commented on SPARK-4672:
--

I can see two possible ways to fix this:
# Define customized closure serialization mechanisms in task serializations, 
which can use reflections to carefully choose which to serialize (i.e., only 
those referenced by RDD.iterator); this potentially needs to deal with many 
details and can be error prone.
# In task serialization, each "base" RDD can generate a dual, "shippable" RDD, 
which has all transient member variables, and only implements the compute() 
function (which in turn calls the compute() function of the "base" RDD through 
ClosureCleaner.clean()); we can then probably rely on the Java serializer to 
handle this correctly.

> Cut off the super long serialization chain in GraphX to avoid the 
> StackOverflow error
> -
>
> Key: SPARK-4672
> URL: https://issues.apache.org/jira/browse/SPARK-4672
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.1.0
>Reporter: Lijie Xu
>Priority: Critical
> Fix For: 1.2.0
>
>
> While running iterative algorithms in GraphX, a StackOverflow error will 
> stably occur in the serialization phase at about 300th iteration. In general, 
> these kinds of algorithms have two things in common:
> # They have a long computing chain.
> {code:borderStyle=solid}
> (e.g., “degreeGraph=>subGraph=>degreeGraph=>subGraph=>…=>”)
> {code}
> # They will iterate many times to converge. An example:
> {code:borderStyle=solid}
> //K-Core Algorithm
> val kNum = 5
> var degreeGraph = graph.outerJoinVertices(graph.degrees) {
>   (vid, vd, degree) => degree.getOrElse(0)
> }.cache()
>   
> do {
>   val subGraph = degreeGraph.subgraph(
>   vpred = (vid, degree) => degree >= KNum
>   ).cache()
>   val newDegreeGraph = subGraph.degrees
>   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
>   (vid, vd, degree) => degree.getOrElse(0)
>   }.cache()
>   isConverged = check(degreeGraph)
> } while(isConverged == false)
> {code}
> After about 300 iterations, StackOverflow will definitely occur with the 
> following stack trace:
> {code:borderStyle=solid}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task serialization failed: java.lang.StackOverflowError
> java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
> java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
> {code}
> It is a very tricky bug, which only occurs with enough iterations. Since it 
> took us a long time to find out its causes, we will detail the causes in the 
> following 3 paragraphs. 
>  
> h3. Phase 1: Try using checkpoint() to shorten the lineage
> It's easy to come to the thought that the long lineage may be the cause. For 
> some RDDs, their lineages may grow with the iterations. Also, for some 
> magical references,  their lineage lengths never decrease and finally become 
> very long. As a result, the call stack of task's 
> serialization()/deserialization() method will be very long too, which finally 
> exhausts the whole JVM stack.
> In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
> OneToOne dependencies in each iteration in the above example. Lineage length 
> refers to the  maximum length of OneToOne dependencies (e.g., from the 
> finalRDD to the ShuffledRDD) in each stage.
> To shorten the lineage, a checkpoint() is performed every N (e.g., 10) 
> iterations. Then, the lineage will drop down when it reaches a certain length 
> (e.g., 33). 
> However, StackOverflow error still occurs after 300+ iterations!
> h3. Phase 2:  Abnormal f closure function leads to a unbreakable 
> serialization chain
> After a long-time debug, we found that an abnormal _*f*_ function closure and 
> a potential bug in GraphX (will be detailed in Phase 3) are the "Suspect 
> Zero". They together build another serialization chain that can bypass the 
> broken lineage cut by checkpoint() (as shown in Figure 1). In other words, 
> the serialization chain can be as long as the original lineage before 
> checkpoint().
> Figure 1 shows how the unbreakable serialization chain is generated. Yes, the 
> OneToOneDep can be cut off by checkpoint(). However, the serialization chain 
> can still access the previous RDDs through the (1)->(2) reference chain. As a 
> result, the checkpoint() action is meaningless and the lineage is as long as 
> that before. 
> !

[jira] [Commented] (SPARK-3910) ./python/pyspark/mllib/classification.py doctests fails with module name pollution

2014-12-02 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232598#comment-14232598
 ] 

Yu Ishikawa commented on SPARK-3910:


I had had the same problem like Tomohiko. However, I resolved this, removing 
all *.pyc under the `python/` directory.

{noformat}
cd $SPARK_HOME && find python -name "*.pyc" -delete
{noformat}

If it is true to solve this problem as I said. In my opinion, there are two 
ways to resolve this issue.
1. remove all `*.pyc` under the `python` directory when running 
`python/run-tests` at least
2. resolve the cyclic import

thanks

> ./python/pyspark/mllib/classification.py doctests fails with module name 
> pollution
> --
>
> Key: SPARK-3910
> URL: https://issues.apache.org/jira/browse/SPARK-3910
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 1.2.0
> Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, 
> Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, 
> argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, 
> pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, 
> unittest2==0.5.1, wsgiref==0.1.2
>Reporter: Tomohiko K.
>  Labels: pyspark, testing
>
> In ./python/run-tests script, we run the doctests in 
> ./pyspark/mllib/classification.py.
> The output is as following:
> {noformat}
> $ ./python/run-tests
> ...
> Running test: pyspark/mllib/classification.py
> Traceback (most recent call last):
>   File "pyspark/mllib/classification.py", line 20, in 
> import numpy
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py",
>  line 170, in 
> from . import add_newdocs
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py",
>  line 13, in 
> from numpy.lib import add_newdoc
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py",
>  line 8, in 
> from .type_check import *
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py",
>  line 11, in 
> import numpy.core.numeric as _nx
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py",
>  line 46, in 
> from numpy.testing import Tester
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py",
>  line 13, in 
> from .utils import *
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py",
>  line 15, in 
> from tempfile import mkdtemp
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py",
>  line 34, in 
> from random import Random as _Random
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py", 
> line 24, in 
> from pyspark.rdd import RDD
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py", line 
> 51, in 
> from pyspark.context import SparkContext
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py", line 
> 22, in 
> from tempfile import NamedTemporaryFile
> ImportError: cannot import name NamedTemporaryFile
> 0.07 real 0.04 user 0.02 sys
> Had test failures; see logs.
> {noformat}
> The problem is a cyclic import of tempfile module.
> The cause of it is that pyspark.mllib.random module exists in the directory 
> where pyspark.mllib.classification module exists.
> classification module imports numpy module, and then numpy module imports 
> tempfile module from its inside.
> Now the first entry sys.path is the directory "./python/pyspark/mllib" (where 
> the executed file "classification.py" exists), so tempfile module imports 
> pyspark.mllib.random module (not the standard library "random" module).
> Finally, import chains reach tempfile again, then a cyclic import is formed.
> Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile 
> → (cyclic import!!)
> Furthermore, stat module is in a standard library, and pyspark.mllib.stat 
> module exists. This also may be troublesome.
> commit: 0e8203f4fb721158fb27897680da476174d24c4b
> A fundamental solution is to avoid using module names used by standard 
> libraries (currently "random" and "stat").
> A difficulty of this solution is to rename pyspark.mllib.random and 
> pyspark.mllib.stat, which may be already used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >