date:20141122

wangfei created SPARK-4552:
--

 Summary: query for empty parquet table in spark sql hive get 
IllegalArgumentException
 Key: SPARK-4552
 URL: https://issues.apache.org/jira/browse/SPARK-4552
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


run
create table test_parquet(key int, value string) stored as parquet;
select * from test_parquet;
get error as follow

java.lang.IllegalArgumentException: Could not find Parquet metadata at path 
file:/user/hive/warehouse/test_parquet
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.sc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4553) query for parquet table with string fields in spark sql hive get binary result

wangfei created SPARK-4553:
--

 Summary: query for parquet table with string fields in spark sql 
hive get binary result
 Key: SPARK-4553
 URL: https://issues.apache.org/jira/browse/SPARK-4553
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


run 
create table test_parquet(key int, value string) stored as parquet;
insert into table test_parquet select * from src;
select * from test_parquet;
get result as follow

...
282 [B@38fda3b
138 [B@1407a24
238 [B@12de6fb
419 [B@6c97695
15 [B@4885067
118 [B@156a8d3
72 [B@65d20dd
90 [B@4c18906
307 [B@60b24cc
19 [B@59cf51b
435 [B@39fdf37
10 [B@4f799d7
277 [B@3950951
273 [B@596bf4b
306 [B@3e91557
224 [B@3781d61
309 [B@2d0d128



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4552) query for empty parquet table in spark sql hive get IllegalArgumentException


[ 
https://issues.apache.org/jira/browse/SPARK-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221889#comment-14221889
 ] 

Apache Spark commented on SPARK-4552:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/3413

 query for empty parquet table in spark sql hive get IllegalArgumentException
 

 Key: SPARK-4552
 URL: https://issues.apache.org/jira/browse/SPARK-4552
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


 run
 create table test_parquet(key int, value string) stored as parquet;
 select * from test_parquet;
 get error as follow
 java.lang.IllegalArgumentException: Could not find Parquet metadata at path 
 file:/user/hive/warehouse/test_parquet
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
 at scala.Option.getOrElse(Option.scala:120)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.sc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4553) query for parquet table with string fields in spark sql hive get binary result


[ 
https://issues.apache.org/jira/browse/SPARK-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221892#comment-14221892
 ] 

Apache Spark commented on SPARK-4553:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/3414

 query for parquet table with string fields in spark sql hive get binary result
 --

 Key: SPARK-4553
 URL: https://issues.apache.org/jira/browse/SPARK-4553
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


 run 
 create table test_parquet(key int, value string) stored as parquet;
 insert into table test_parquet select * from src;
 select * from test_parquet;
 get result as follow
 ...
 282 [B@38fda3b
 138 [B@1407a24
 238 [B@12de6fb
 419 [B@6c97695
 15 [B@4885067
 118 [B@156a8d3
 72 [B@65d20dd
 90 [B@4c18906
 307 [B@60b24cc
 19 [B@59cf51b
 435 [B@39fdf37
 10 [B@4f799d7
 277 [B@3950951
 273 [B@596bf4b
 306 [B@3e91557
 224 [B@3781d61
 309 [B@2d0d128



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4288) Add Sparse Autoencoder algorithm to MLlib

2014-11-22 Thread Kai Sasaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221895#comment-14221895
 ] 

Kai Sasaki commented on SPARK-4288:
---

[~mengxr] Thank you. I'll join. 

 Add Sparse Autoencoder algorithm to MLlib 
 --

 Key: SPARK-4288
 URL: https://issues.apache.org/jira/browse/SPARK-4288
 Project: Spark
  Issue Type: Wish
  Components: MLlib
Reporter: Guoqiang Li
  Labels: features

 Are you proposing an implementation? Is it related to the neural network JIRA?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4554) Set fair scheduler pool for JDBC client session in hive 13

wangfei created SPARK-4554:
--

 Summary: Set fair scheduler pool for JDBC client session in hive 13
 Key: SPARK-4554
 URL: https://issues.apache.org/jira/browse/SPARK-4554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


Now hive 13 shim does not support to set fair scheduler pool 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4554) Set fair scheduler pool for JDBC client session in hive 13


[ 
https://issues.apache.org/jira/browse/SPARK-4554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14221950#comment-14221950
 ] 

Apache Spark commented on SPARK-4554:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/3416

 Set fair scheduler pool for JDBC client session in hive 13
 --

 Key: SPARK-4554
 URL: https://issues.apache.org/jira/browse/SPARK-4554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


 Now hive 13 shim does not support to set fair scheduler pool 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222024#comment-14222024
 ] 

Debasish Das commented on SPARK-1405:
-

We need a larger dataset as well where topics go to the range of 1+...That 
range will stress factorization based LSA formulations since there is broadcast 
of factors at each stepNIPS dataset is small...you guy's will be willing to 
test a wikipedia dataset for example ? If there is a pre-processed version from 
either mahout or scikit-learn we can use that ?

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222024#comment-14222024
 ] 

Debasish Das edited comment on SPARK-1405 at 11/22/14 4:22 PM:
---

We need a larger dataset as well where topics go to the range of 1+...That 
range will stress factorization based LSA formulations since there is broadcast 
of factors at each stepNIPS dataset is small...Let's start with that...But 
we should test a large dataset like wikipedia as well..If there is a 
pre-processed version from either mahout or scikit-learn we can use that ?


was (Author: debasish83):
We need a larger dataset as well where topics go to the range of 1+...That 
range will stress factorization based LSA formulations since there is broadcast 
of factors at each stepNIPS dataset is small...you guy's will be willing to 
test a wikipedia dataset for example ? If there is a pre-processed version from 
either mahout or scikit-learn we can use that ?

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222027#comment-14222027
 ] 

Debasish Das commented on SPARK-1405:
-

[~pedrorodriguez] did you write the metric in your repo as well ? That way I 
don't have to code it up again..

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Pedro Rodriguez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222030#comment-14222030
 ] 

Pedro Rodriguez commented on SPARK-1405:


I don't know of a larger data set, but I am working on an LDA data set 
generator based on the generative model. It should be good for benchmark 
testing but still be reasonable from the ML perspective.

The metric is in the LDA code (which is turned on and off with a flag on the 
LDA model). You can find it here in the logLikelihood function:
https://github.com/EntilZha/spark/blob/LDA/graphx/src/main/scala/org/apache/spark/graphx/lib/LDA.scala

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222033#comment-14222033
 ] 

Guoqiang Li commented on SPARK-1405:


OK, Where is the download URL?

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222048#comment-14222048
 ] 

Guoqiang Li commented on SPARK-1405:


Sorry, I mean the wikipedia data download URL. How much text we need it? I 
think one billion words is appropriate.

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222089#comment-14222089
 ] 

Debasish Das commented on SPARK-1405:
-

NIPS dataset is common for PLSA and additive regularization based matrix 
factorization formulations as well since the experiments in this paper focused 
on the NIPS dataset as well... 
http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf

I will be using NIPS dataset for quality experiments but for scaling 
experiments, wiki data is good...wiki data was demo-ed by Databricks in last 
spark summit...it will be great if we can get it from that demo

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Evan Sparks (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222105#comment-14222105
]

Evan Sparks commented on SPARK-1405:

[~gq] - Those are great numbers for a very high number of topics - it's a
little tough to follow what's leading to the super-linear scaling in #topics in
your code, though. Are you using FastLDA or something similar to speed up
sampling? (http://www.ics.uci.edu/~newman/pubs/fastlda.pdf)

Pedro has been testing on a wikipedia dump on s3 which I provided. It's XML
formatted, one document per line, so it's easy to parse. I will copy this to a
requester-pays bucket (which will be free if you run your experiments on ec2)
now so that everyone working on this can use it for testing.

NIPS dataset seems fine for small-scale testing, but I think it's important
that we test this implementation across a range of values for documents, words,
topics, and tokens - hence, I think the data generator that Pedro is working on
is a really good idea (and follows the convention of the existing data
generators in MLlib). We'll have to be a little careful here, because some of
the methods for making LDA fast rely on the fact that it tends to converge
fast, and I expect that data generated by the model will be much easier to fit
than real data.

Also, can we try and be consistent in our terminology - getting the # of unique
words confused with all the words in a corpus is easy. I propose words and
tokens for these two things.

parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
-

Key: SPARK-1405
URL: https://issues.apache.org/jira/browse/SPARK-1405
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
Labels: features
Attachments: performance_comparison.png

Original Estimate: 336h
Remaining Estimate: 336h

Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts
topics from text corpus. Different with current machine learning algorithms
in MLlib, instead of using optimization algorithms such as gradient desent,
LDA uses expectation algorithms such as Gibbs sampling.
In this PR, I prepare a LDA implementation based on Gibbs sampling, with a
wholeTextFiles API (solved yet), a word segmentation (import from Lucene),
and a Gibbs sampling core.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222108#comment-14222108
 ] 

Debasish Das commented on SPARK-1405:
-

@sparks that will be awesome...I should be fine running experiments on EC2...

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222108#comment-14222108
 ] 

Debasish Das edited comment on SPARK-1405 at 11/22/14 6:40 PM:
---

[~sparks] that will be awesome...I should be fine running experiments on EC2...


was (Author: debasish83):
@sparks that will be awesome...I should be fine running experiments on EC2...

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Evan Sparks (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222112#comment-14222112
 ] 

Evan Sparks commented on SPARK-1405:


Bucket has been created: 
s3://files.sparks.requester.pays/enwiki_category_text/ - All in all there are 
181 ~50mb files (actually closer to 10GB). 

It probably makes sense to use http://sweble.org/ or something to strip the 
boilerplate, etc. from the documents for the purposes of topic modeling.

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4555) Add forward compatibility tests to JsonProtocol

Josh Rosen created SPARK-4555:
-

 Summary: Add forward compatibility tests to JsonProtocol
 Key: SPARK-4555
 URL: https://issues.apache.org/jira/browse/SPARK-4555
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Josh Rosen


The web UI / event listener's JsonProtocol is designed to be backwards- and 
forwards-compatible: newer versions of Spark should be able to consume event 
logs written by older versions and vice-versa.

We currently have backwards-compatibility tests for the newer version reads 
older log case; this JIRA tracks progress for adding the opposite 
forwards-compatibility tests.

This type of test could be non-trivial to write, since I think we'd need to 
actually run a script against multiple compiled Spark releases, so this test 
might need to sit outside of Spark Core itself as part of an integration 
testing suite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4548) Python broadcast is very slow


[ 
https://issues.apache.org/jira/browse/SPARK-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222154#comment-14222154
 ] 

Apache Spark commented on SPARK-4548:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3417

 Python broadcast is very slow
 -

 Key: SPARK-4548
 URL: https://issues.apache.org/jira/browse/SPARK-4548
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Davies Liu

 Python broadcast in 1.2 is much slower than 1.1: 
 In spark-perf tests:
   name1.1 1.2  speedup
 python-broadcast-w-set3.6316.68   -78.23%



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS


[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222172#comment-14222172
 ] 

Patrick Wendell commented on SPARK-4516:


Okay then I think this is just a documentation issue. We should add the 
documentation about direct buffers to the main configuration page and also 
mention it in the doc about network options.

 Netty off-heap memory use causes executors to be killed by OS
 -

 Key: SPARK-4516
 URL: https://issues.apache.org/jira/browse/SPARK-4516
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
 Environment: Linux, Mesos
Reporter: Hector Yee
Priority: Critical
  Labels: netty, shuffle

 The netty block transfer manager has a race condition where it closes an 
 active connection resulting in the error below. Switching to nio seems to 
 alleviate the problem.
 {code}
 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
 of 1 outstanding blocks 
 java.io.IOException: Failed to connect to 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at 
 org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
 at 
 org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
 at 
 org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
 at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
 at 
 com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.net.ConnectException: Connection refused: 
 i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at 
 io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
 at 
 io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
 at 
 io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
 at 
 io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For

[jira] [Updated] (SPARK-2143) Display Spark version on Driver web page


 [ 
https://issues.apache.org/jira/browse/SPARK-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2143:
---
Priority: Critical  (was: Major)

 Display Spark version on Driver web page
 

 Key: SPARK-2143
 URL: https://issues.apache.org/jira/browse/SPARK-2143
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Jeff Hammerbacher
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4542) Post nightly releases


 [ 
https://issues.apache.org/jira/browse/SPARK-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4542.

Resolution: Duplicate

 Post nightly releases
 -

 Key: SPARK-4542
 URL: https://issues.apache.org/jira/browse/SPARK-4542
 Project: Spark
  Issue Type: Improvement
Reporter: Arun Ahuja

 Spark developers are continually including new improvements and fixes to 
 sometimes critfical issues.  To speed up review and resolve the issues faster 
 it will faster for multiple people to test ( or use those fixes if they are 
 critical ) if there are 1) snapshots to maven and 2) Full 
 distribution/scripts perhaps posted somewhere.  Otherwise each individual 
 developer has to pull and rebuild which maybe a very long process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds


 [ 
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1517:
---
Fix Version/s: (was: 1.2.0)

 Publish nightly snapshots of documentation, maven artifacts, and binary builds
 --

 Key: SPARK-1517
 URL: https://issues.apache.org/jira/browse/SPARK-1517
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Patrick Wendell

 Should be pretty easy to do with Jenkins. The only thing I can think of that 
 would be tricky is to set up credentials so that jenkins can publish this 
 stuff somewhere on apache infra.
 Ideally we don't want to have to put a private key on every jenkins box 
 (since they are otherwise pretty stateless). One idea is to encrypt these 
 credentials with a passphrase and post them somewhere publicly visible. Then 
 the jenkins build can download the credentials provided we set a passphrase 
 in an environment variable in jenkins. There may be simpler solutions as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds


 [ 
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1517:
---
Priority: Critical  (was: Major)

 Publish nightly snapshots of documentation, maven artifacts, and binary builds
 --

 Key: SPARK-1517
 URL: https://issues.apache.org/jira/browse/SPARK-1517
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Patrick Wendell
Priority: Critical

 Should be pretty easy to do with Jenkins. The only thing I can think of that 
 would be tricky is to set up credentials so that jenkins can publish this 
 stuff somewhere on apache infra.
 Ideally we don't want to have to put a private key on every jenkins box 
 (since they are otherwise pretty stateless). One idea is to encrypt these 
 credentials with a passphrase and post them somewhere publicly visible. Then 
 the jenkins build can download the credentials provided we set a passphrase 
 in an environment variable in jenkins. There may be simpler solutions as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds


 [ 
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1517:
---
Target Version/s: 1.3.0

 Publish nightly snapshots of documentation, maven artifacts, and binary builds
 --

 Key: SPARK-1517
 URL: https://issues.apache.org/jira/browse/SPARK-1517
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Patrick Wendell

 Should be pretty easy to do with Jenkins. The only thing I can think of that 
 would be tricky is to set up credentials so that jenkins can publish this 
 stuff somewhere on apache infra.
 Ideally we don't want to have to put a private key on every jenkins box 
 (since they are otherwise pretty stateless). One idea is to encrypt these 
 credentials with a passphrase and post them somewhere publicly visible. Then 
 the jenkins build can download the credentials provided we set a passphrase 
 in an environment variable in jenkins. There may be simpler solutions as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4507) PR merge script should support closing multiple JIRA tickets


 [ 
https://issues.apache.org/jira/browse/SPARK-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4507:
---
Labels: starter  (was: )

 PR merge script should support closing multiple JIRA tickets
 

 Key: SPARK-4507
 URL: https://issues.apache.org/jira/browse/SPARK-4507
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Josh Rosen
Priority: Minor
  Labels: starter

 For pull requests that reference multiple JIRAs in their titles, it would be 
 helpful if the PR merge script offered to close all of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4556) binary distribution assembly can't run in local mode

2014-11-22 Thread Sean Busbey (JIRA)

Sean Busbey created SPARK-4556:
--

 Summary: binary distribution assembly can't run in local mode
 Key: SPARK-4556
 URL: https://issues.apache.org/jira/browse/SPARK-4556
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Reporter: Sean Busbey


After building the binary distribution assembly, the resultant tarball can't be 
used for local mode.

{code}
busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package
[INFO] Scanning for projects...
...SNIP...
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [ 32.227 s]
[INFO] Spark Project Networking ... SUCCESS [ 31.402 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  8.864 s]
[INFO] Spark Project Core . SUCCESS [15:39 min]
[INFO] Spark Project Bagel  SUCCESS [ 29.470 s]
[INFO] Spark Project GraphX ... SUCCESS [05:20 min]
[INFO] Spark Project Streaming  SUCCESS [11:02 min]
[INFO] Spark Project Catalyst . SUCCESS [11:26 min]
[INFO] Spark Project SQL .. SUCCESS [11:33 min]
[INFO] Spark Project ML Library ... SUCCESS [14:27 min]
[INFO] Spark Project Tools  SUCCESS [ 40.980 s]
[INFO] Spark Project Hive . SUCCESS [11:45 min]
[INFO] Spark Project REPL . SUCCESS [03:15 min]
[INFO] Spark Project Assembly . SUCCESS [04:22 min]
[INFO] Spark Project External Twitter . SUCCESS [ 43.567 s]
[INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 s]
[INFO] Spark Project External Flume ... SUCCESS [01:41 min]
[INFO] Spark Project External MQTT  SUCCESS [ 40.973 s]
[INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 s]
[INFO] Spark Project External Kafka ... SUCCESS [01:23 min]
[INFO] Spark Project Examples . SUCCESS [10:19 min]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 01:47 h
[INFO] Finished at: 2014-11-22T02:13:51-06:00
[INFO] Final Memory: 79M/2759M
[INFO] 
{code}
busbey2-MBA:spark busbey$ cd assembly/target/
busbey2-MBA:target busbey$ mkdir dist-temp
busbey2-MBA:target busbey$ tar -C dist-temp -xzf 
spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz 
busbey2-MBA:target busbey$ cd dist-temp/
busbey2-MBA:dist-temp busbey$ ./bin/spark-shell
ls: 
/Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10:
 No such file or directory
Failed to find Spark assembly in 
/Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10
You need to build Spark before running this program.
{code}

It looks like the classpath calculations in {{bin/compute_classpath.sh}} don't 
handle it.

If I move all of the spark-*.jar files from the top level into the lib folder 
and touch the RELEASE file, then the spark shell launches in local mode 
normally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4556) binary distribution assembly can't run in local mode

2014-11-22 Thread Sean Busbey (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated SPARK-4556:
---
Description: 
After building the binary distribution assembly, the resultant tarball can't be 
used for local mode.

{code}
busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package
[INFO] Scanning for projects...
...SNIP...
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [ 32.227 s]
[INFO] Spark Project Networking ... SUCCESS [ 31.402 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  8.864 s]
[INFO] Spark Project Core . SUCCESS [15:39 min]
[INFO] Spark Project Bagel  SUCCESS [ 29.470 s]
[INFO] Spark Project GraphX ... SUCCESS [05:20 min]
[INFO] Spark Project Streaming  SUCCESS [11:02 min]
[INFO] Spark Project Catalyst . SUCCESS [11:26 min]
[INFO] Spark Project SQL .. SUCCESS [11:33 min]
[INFO] Spark Project ML Library ... SUCCESS [14:27 min]
[INFO] Spark Project Tools  SUCCESS [ 40.980 s]
[INFO] Spark Project Hive . SUCCESS [11:45 min]
[INFO] Spark Project REPL . SUCCESS [03:15 min]
[INFO] Spark Project Assembly . SUCCESS [04:22 min]
[INFO] Spark Project External Twitter . SUCCESS [ 43.567 s]
[INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 s]
[INFO] Spark Project External Flume ... SUCCESS [01:41 min]
[INFO] Spark Project External MQTT  SUCCESS [ 40.973 s]
[INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 s]
[INFO] Spark Project External Kafka ... SUCCESS [01:23 min]
[INFO] Spark Project Examples . SUCCESS [10:19 min]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 01:47 h
[INFO] Finished at: 2014-11-22T02:13:51-06:00
[INFO] Final Memory: 79M/2759M
[INFO] 
busbey2-MBA:spark busbey$ cd assembly/target/
busbey2-MBA:target busbey$ mkdir dist-temp
busbey2-MBA:target busbey$ tar -C dist-temp -xzf 
spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz 
busbey2-MBA:target busbey$ cd dist-temp/
busbey2-MBA:dist-temp busbey$ ./bin/spark-shell
ls: 
/Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10:
 No such file or directory
Failed to find Spark assembly in 
/Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10
You need to build Spark before running this program.
{code}

It looks like the classpath calculations in {{bin/compute_classpath.sh}} don't 
handle it.

If I move all of the spark-*.jar files from the top level into the lib folder 
and touch the RELEASE file, then the spark shell launches in local mode 
normally.

  was:
After building the binary distribution assembly, the resultant tarball can't be 
used for local mode.

{code}
busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package
[INFO] Scanning for projects...
...SNIP...
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [ 32.227 s]
[INFO] Spark Project Networking ... SUCCESS [ 31.402 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  8.864 s]
[INFO] Spark Project Core . SUCCESS [15:39 min]
[INFO] Spark Project Bagel  SUCCESS [ 29.470 s]
[INFO] Spark Project GraphX ... SUCCESS [05:20 min]
[INFO] Spark Project Streaming  SUCCESS [11:02 min]
[INFO] Spark Project Catalyst . SUCCESS [11:26 min]
[INFO] Spark Project SQL .. SUCCESS [11:33 min]
[INFO] Spark Project ML Library ... SUCCESS [14:27 min]
[INFO] Spark Project Tools  SUCCESS [ 40.980 s]
[INFO] Spark Project Hive . SUCCESS [11:45 min]
[INFO] Spark Project REPL . SUCCESS [03:15 min]
[INFO] Spark Project Assembly . SUCCESS [04:22 min]
[INFO] Spark Project External Twitter . SUCCESS [ 43.567 s]
[INFO] Spark Project External Flume Sink

[jira] [Commented] (SPARK-4556) binary distribution assembly can't run in local mode


[ 
https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1406#comment-1406
 ] 

Sean Owen commented on SPARK-4556:
--

Hm, but is that a bug? I think compute-classpath.sh is designed to support 
running from the project root in development, or running from the files as laid 
out in the release, at least judging from your comments and the script itself. 
I don't think the raw contents of the assembly JAR themselves are a runnable 
installation.

 binary distribution assembly can't run in local mode
 

 Key: SPARK-4556
 URL: https://issues.apache.org/jira/browse/SPARK-4556
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Reporter: Sean Busbey

 After building the binary distribution assembly, the resultant tarball can't 
 be used for local mode.
 {code}
 busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package
 [INFO] Scanning for projects...
 ...SNIP...
 [INFO] 
 
 [INFO] Reactor Summary:
 [INFO] 
 [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 
 s]
 [INFO] Spark Project Networking ... SUCCESS [ 31.402 
 s]
 [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  8.864 
 s]
 [INFO] Spark Project Core . SUCCESS [15:39 
 min]
 [INFO] Spark Project Bagel  SUCCESS [ 29.470 
 s]
 [INFO] Spark Project GraphX ... SUCCESS [05:20 
 min]
 [INFO] Spark Project Streaming  SUCCESS [11:02 
 min]
 [INFO] Spark Project Catalyst . SUCCESS [11:26 
 min]
 [INFO] Spark Project SQL .. SUCCESS [11:33 
 min]
 [INFO] Spark Project ML Library ... SUCCESS [14:27 
 min]
 [INFO] Spark Project Tools  SUCCESS [ 40.980 
 s]
 [INFO] Spark Project Hive . SUCCESS [11:45 
 min]
 [INFO] Spark Project REPL . SUCCESS [03:15 
 min]
 [INFO] Spark Project Assembly . SUCCESS [04:22 
 min]
 [INFO] Spark Project External Twitter . SUCCESS [ 43.567 
 s]
 [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 
 s]
 [INFO] Spark Project External Flume ... SUCCESS [01:41 
 min]
 [INFO] Spark Project External MQTT  SUCCESS [ 40.973 
 s]
 [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 
 s]
 [INFO] Spark Project External Kafka ... SUCCESS [01:23 
 min]
 [INFO] Spark Project Examples . SUCCESS [10:19 
 min]
 [INFO] 
 
 [INFO] BUILD SUCCESS
 [INFO] 
 
 [INFO] Total time: 01:47 h
 [INFO] Finished at: 2014-11-22T02:13:51-06:00
 [INFO] Final Memory: 79M/2759M
 [INFO] 
 
 busbey2-MBA:spark busbey$ cd assembly/target/
 busbey2-MBA:target busbey$ mkdir dist-temp
 busbey2-MBA:target busbey$ tar -C dist-temp -xzf 
 spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz 
 busbey2-MBA:target busbey$ cd dist-temp/
 busbey2-MBA:dist-temp busbey$ ./bin/spark-shell
 ls: 
 /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10:
  No such file or directory
 Failed to find Spark assembly in 
 /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10
 You need to build Spark before running this program.
 {code}
 It looks like the classpath calculations in {{bin/compute_classpath.sh}} 
 don't handle it.
 If I move all of the spark-*.jar files from the top level into the lib folder 
 and touch the RELEASE file, then the spark shell launches in local mode 
 normally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4377) ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to deserialize a serialized ActorRef without an ActorSystem in scope.


 [ 
https://issues.apache.org/jira/browse/SPARK-4377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4377:
---
Fix Version/s: 1.3.0

 ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to 
 deserialize a serialized ActorRef without an ActorSystem in scope.
 -

 Key: SPARK-4377
 URL: https://issues.apache.org/jira/browse/SPARK-4377
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.2.0
Reporter: Josh Rosen
Assignee: Prashant Sharma
Priority: Blocker
 Fix For: 1.3.0


 It looks like ZooKeeperPersistenceEngine is broken in the current Spark 
 master (23f5bdf06a388e08ea5a69e848f0ecd5165aa481).  Here's a log excerpt from 
 a secondary master when it takes over from a failed primary master:
 {code}
 14/11/13 04:37:12 WARN ConnectionStateManager: There are no 
 ConnectionStateListeners registered.
 14/11/13 04:37:19 INFO Master: Registering worker 172.17.0.223: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:20 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.223: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:43 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:47 INFO Master: Registering worker 172.17.0.223: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:51 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.223: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:38:06 INFO ZooKeeperLeaderElectionAgent: We have gained leadership
 14/11/13 04:38:06 WARN ZooKeeperPersistenceEngine: Exception while reading 
 persisted file, deleting
 java.io.IOException: java.lang.IllegalStateException: Trying to deserialize a 
 serialized ActorRef without an ActorSystem in scope. Use 
 'akka.serialization.Serialization.currentSystem.withValue(system) { ... }'
   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:988)
   at 
 org.apache.spark.deploy.master.ApplicationInfo.readObject(ApplicationInfo.scala:51)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.deserializeFromFile(ZooKeeperPersistenceEngine.scala:69)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:54)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:32)
   at 
 org.apache.spark.deploy.master.PersistenceEngine$class.readPersistedData(PersistenceEngine.scala:84)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.readPersistedData(ZooKeeperPersistenceEngine.scala:32)
   at

[jira] [Commented] (SPARK-4556) binary distribution assembly can't run in local mode

2014-11-22 Thread Sean Busbey (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1410#comment-1410
 ] 

Sean Busbey commented on SPARK-4556:


Well, why does the layout of the binary distribution differ from the layout in 
a release?

At a minimum the README should be updated to clarify the purpose of the binary 
distribution. Preferably, the README should include instructions for taking the 
binary distribution and deploying it to be runnable.

 binary distribution assembly can't run in local mode
 

 Key: SPARK-4556
 URL: https://issues.apache.org/jira/browse/SPARK-4556
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Reporter: Sean Busbey

 After building the binary distribution assembly, the resultant tarball can't 
 be used for local mode.
 {code}
 busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package
 [INFO] Scanning for projects...
 ...SNIP...
 [INFO] 
 
 [INFO] Reactor Summary:
 [INFO] 
 [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 
 s]
 [INFO] Spark Project Networking ... SUCCESS [ 31.402 
 s]
 [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  8.864 
 s]
 [INFO] Spark Project Core . SUCCESS [15:39 
 min]
 [INFO] Spark Project Bagel  SUCCESS [ 29.470 
 s]
 [INFO] Spark Project GraphX ... SUCCESS [05:20 
 min]
 [INFO] Spark Project Streaming  SUCCESS [11:02 
 min]
 [INFO] Spark Project Catalyst . SUCCESS [11:26 
 min]
 [INFO] Spark Project SQL .. SUCCESS [11:33 
 min]
 [INFO] Spark Project ML Library ... SUCCESS [14:27 
 min]
 [INFO] Spark Project Tools  SUCCESS [ 40.980 
 s]
 [INFO] Spark Project Hive . SUCCESS [11:45 
 min]
 [INFO] Spark Project REPL . SUCCESS [03:15 
 min]
 [INFO] Spark Project Assembly . SUCCESS [04:22 
 min]
 [INFO] Spark Project External Twitter . SUCCESS [ 43.567 
 s]
 [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 
 s]
 [INFO] Spark Project External Flume ... SUCCESS [01:41 
 min]
 [INFO] Spark Project External MQTT  SUCCESS [ 40.973 
 s]
 [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 
 s]
 [INFO] Spark Project External Kafka ... SUCCESS [01:23 
 min]
 [INFO] Spark Project Examples . SUCCESS [10:19 
 min]
 [INFO] 
 
 [INFO] BUILD SUCCESS
 [INFO] 
 
 [INFO] Total time: 01:47 h
 [INFO] Finished at: 2014-11-22T02:13:51-06:00
 [INFO] Final Memory: 79M/2759M
 [INFO] 
 
 busbey2-MBA:spark busbey$ cd assembly/target/
 busbey2-MBA:target busbey$ mkdir dist-temp
 busbey2-MBA:target busbey$ tar -C dist-temp -xzf 
 spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz 
 busbey2-MBA:target busbey$ cd dist-temp/
 busbey2-MBA:dist-temp busbey$ ./bin/spark-shell
 ls: 
 /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10:
  No such file or directory
 Failed to find Spark assembly in 
 /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10
 You need to build Spark before running this program.
 {code}
 It looks like the classpath calculations in {{bin/compute_classpath.sh}} 
 don't handle it.
 If I move all of the spark-*.jar files from the top level into the lib folder 
 and touch the RELEASE file, then the spark shell launches in local mode 
 normally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4556) binary distribution assembly can't run in local mode


[ 
https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1415#comment-1415
 ] 

Patrick Wendell commented on SPARK-4556:


Checkout make-distribution.sh rather than using maven directly. We might 
consider removing that maven target since I don't think it's actively 
maintained.

 binary distribution assembly can't run in local mode
 

 Key: SPARK-4556
 URL: https://issues.apache.org/jira/browse/SPARK-4556
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Reporter: Sean Busbey

 After building the binary distribution assembly, the resultant tarball can't 
 be used for local mode.
 {code}
 busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package
 [INFO] Scanning for projects...
 ...SNIP...
 [INFO] 
 
 [INFO] Reactor Summary:
 [INFO] 
 [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 
 s]
 [INFO] Spark Project Networking ... SUCCESS [ 31.402 
 s]
 [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  8.864 
 s]
 [INFO] Spark Project Core . SUCCESS [15:39 
 min]
 [INFO] Spark Project Bagel  SUCCESS [ 29.470 
 s]
 [INFO] Spark Project GraphX ... SUCCESS [05:20 
 min]
 [INFO] Spark Project Streaming  SUCCESS [11:02 
 min]
 [INFO] Spark Project Catalyst . SUCCESS [11:26 
 min]
 [INFO] Spark Project SQL .. SUCCESS [11:33 
 min]
 [INFO] Spark Project ML Library ... SUCCESS [14:27 
 min]
 [INFO] Spark Project Tools  SUCCESS [ 40.980 
 s]
 [INFO] Spark Project Hive . SUCCESS [11:45 
 min]
 [INFO] Spark Project REPL . SUCCESS [03:15 
 min]
 [INFO] Spark Project Assembly . SUCCESS [04:22 
 min]
 [INFO] Spark Project External Twitter . SUCCESS [ 43.567 
 s]
 [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 
 s]
 [INFO] Spark Project External Flume ... SUCCESS [01:41 
 min]
 [INFO] Spark Project External MQTT  SUCCESS [ 40.973 
 s]
 [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 
 s]
 [INFO] Spark Project External Kafka ... SUCCESS [01:23 
 min]
 [INFO] Spark Project Examples . SUCCESS [10:19 
 min]
 [INFO] 
 
 [INFO] BUILD SUCCESS
 [INFO] 
 
 [INFO] Total time: 01:47 h
 [INFO] Finished at: 2014-11-22T02:13:51-06:00
 [INFO] Final Memory: 79M/2759M
 [INFO] 
 
 busbey2-MBA:spark busbey$ cd assembly/target/
 busbey2-MBA:target busbey$ mkdir dist-temp
 busbey2-MBA:target busbey$ tar -C dist-temp -xzf 
 spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz 
 busbey2-MBA:target busbey$ cd dist-temp/
 busbey2-MBA:dist-temp busbey$ ./bin/spark-shell
 ls: 
 /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10:
  No such file or directory
 Failed to find Spark assembly in 
 /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10
 You need to build Spark before running this program.
 {code}
 It looks like the classpath calculations in {{bin/compute_classpath.sh}} 
 don't handle it.
 If I move all of the spark-*.jar files from the top level into the lib folder 
 and touch the RELEASE file, then the spark shell launches in local mode 
 normally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4556) binary distribution assembly can't run in local mode


[ 
https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1415#comment-1415
 ] 

Patrick Wendell edited comment on SPARK-4556 at 11/22/14 10:17 PM:
---

Checkout make-distribution.sh rather than using maven directly. We might 
consider removing that maven target since I don't think it's actively 
maintained. We should document clearly that make-distribution.sh is way of 
building binaries.


was (Author: pwendell):
Checkout make-distribution.sh rather than using maven directly. We might 
consider removing that maven target since I don't think it's actively 
maintained.

 binary distribution assembly can't run in local mode
 

 Key: SPARK-4556
 URL: https://issues.apache.org/jira/browse/SPARK-4556
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Reporter: Sean Busbey

 After building the binary distribution assembly, the resultant tarball can't 
 be used for local mode.
 {code}
 busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package
 [INFO] Scanning for projects...
 ...SNIP...
 [INFO] 
 
 [INFO] Reactor Summary:
 [INFO] 
 [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 
 s]
 [INFO] Spark Project Networking ... SUCCESS [ 31.402 
 s]
 [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  8.864 
 s]
 [INFO] Spark Project Core . SUCCESS [15:39 
 min]
 [INFO] Spark Project Bagel  SUCCESS [ 29.470 
 s]
 [INFO] Spark Project GraphX ... SUCCESS [05:20 
 min]
 [INFO] Spark Project Streaming  SUCCESS [11:02 
 min]
 [INFO] Spark Project Catalyst . SUCCESS [11:26 
 min]
 [INFO] Spark Project SQL .. SUCCESS [11:33 
 min]
 [INFO] Spark Project ML Library ... SUCCESS [14:27 
 min]
 [INFO] Spark Project Tools  SUCCESS [ 40.980 
 s]
 [INFO] Spark Project Hive . SUCCESS [11:45 
 min]
 [INFO] Spark Project REPL . SUCCESS [03:15 
 min]
 [INFO] Spark Project Assembly . SUCCESS [04:22 
 min]
 [INFO] Spark Project External Twitter . SUCCESS [ 43.567 
 s]
 [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 
 s]
 [INFO] Spark Project External Flume ... SUCCESS [01:41 
 min]
 [INFO] Spark Project External MQTT  SUCCESS [ 40.973 
 s]
 [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 
 s]
 [INFO] Spark Project External Kafka ... SUCCESS [01:23 
 min]
 [INFO] Spark Project Examples . SUCCESS [10:19 
 min]
 [INFO] 
 
 [INFO] BUILD SUCCESS
 [INFO] 
 
 [INFO] Total time: 01:47 h
 [INFO] Finished at: 2014-11-22T02:13:51-06:00
 [INFO] Final Memory: 79M/2759M
 [INFO] 
 
 busbey2-MBA:spark busbey$ cd assembly/target/
 busbey2-MBA:target busbey$ mkdir dist-temp
 busbey2-MBA:target busbey$ tar -C dist-temp -xzf 
 spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz 
 busbey2-MBA:target busbey$ cd dist-temp/
 busbey2-MBA:dist-temp busbey$ ./bin/spark-shell
 ls: 
 /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10:
  No such file or directory
 Failed to find Spark assembly in 
 /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10
 You need to build Spark before running this program.
 {code}
 It looks like the classpath calculations in {{bin/compute_classpath.sh}} 
 don't handle it.
 If I move all of the spark-*.jar files from the top level into the lib folder 
 and touch the RELEASE file, then the spark shell launches in local mode 
 normally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4557) Spark Streaming' foreachRDD method should accept a VoidFunction..., not a Function..., Void


 [ 
https://issues.apache.org/jira/browse/SPARK-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4557:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

(Don't think this is a bug, really.) Yes, it's possible VoidFunction didn't 
exist when this API was defined. It can't be changed now without breaking API 
compatibility but AFAICT VoidFunction would be more appropriate. Maybe this can 
happen with some other related Java API rationalization in Spark 2.x.

 Spark Streaming' foreachRDD method should accept a VoidFunction..., not a 
 Function..., Void
 ---

 Key: SPARK-4557
 URL: https://issues.apache.org/jira/browse/SPARK-4557
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Alexis Seigneurin
Priority: Minor

 In *Java*, using Spark Streaming's foreachRDD function is quite verbose. You 
 have to write:
 {code:java}
 .foreachRDD(items - {
 ...;
 return null;
 });
 {code}
 Instead of:
 {code:java}
 .foreachRDD(items - ...);
 {code}
 This is because the foreachRDD method accepts a FunctionJavaRDD..., Void 
 instead of a VoidFunctionJavaRDD This would make sense to change it 
 to a VoidFunction as, in Spark's API, the foreach method already accepts a 
 VoidFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4490) Not found RandomGenerator through spark-shell


[ 
https://issues.apache.org/jira/browse/SPARK-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1430#comment-1430
 ] 

Sean Owen commented on SPARK-4490:
--

commons-math3 is still a dependency of core, yes. Are you saying this works 
with HEAD? that would make more sense, but in general I think you still would 
want to explicitly add breeze and commons-math3 to the classpath if you want to 
use them in spark-shell rather than rely on them being in the assembly.

 Not found RandomGenerator through spark-shell
 -

 Key: SPARK-4490
 URL: https://issues.apache.org/jira/browse/SPARK-4490
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: spark-shell
Reporter: Kai Sasaki

 In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 
 is used. There is some workaround about this problem.
 http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3
 ```
 scala import breeze.linalg._
 import breeze.linalg._
 scala Matrix.rand[Double](3, 3)
 java.lang.NoClassDefFoundError: 
 org/apache/commons/math3/random/RandomGenerator
 at 
 breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205)
 at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:14)
 at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:19)
 at $iwC$$iwC$$iwC$$iwC.init(console:21)
 at $iwC$$iwC$$iwC.init(console:23)
 at $iwC$$iwC.init(console:25)
 at $iwC.init(console:27)
 at init(console:29)
 at .init(console:33)
 at .clinit(console)
 at .init(console:7)
 at .clinit(console)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
 at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
 at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
 at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
 at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
 at 
 org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
 at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
 at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
 at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
 at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 aused by: java.lang.ClassNotFoundException: 
 org.apache.commons.math3.random.RandomGenerator
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 44 more
 ```



--
This message was sent by

[jira] [Created] (SPARK-4558) History Server waits ~10s before starting up

2014-11-22 Thread Andrew Or (JIRA)

Andrew Or created SPARK-4558:


 Summary: History Server waits ~10s before starting up
 Key: SPARK-4558
 URL: https://issues.apache.org/jira/browse/SPARK-4558
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor


After you call `sbin/start-history-server.sh`, it waits about 10s before 
actually starting up. I suspect this is a subtle bug related to log checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4530) GradientDescent get a wrong gradient value according to the gradient formula, which is caused by the miniBatchSize parameter.


 [ 
https://issues.apache.org/jira/browse/SPARK-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4530:
-
Priority: Major  (was: Blocker)

See comments on the PR. I don't think these things rise to the level of 
'blocker'

 GradientDescent get a wrong gradient value according to the gradient formula, 
 which is caused by the miniBatchSize parameter.
 -

 Key: SPARK-4530
 URL: https://issues.apache.org/jira/browse/SPARK-4530
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0, 1.1.0, 1.2.0
Reporter: Guoqiang Li

 This bug is caused by {{RDD.sample}}
 The number of  {{RDD.sample}}  returns is not fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4489) JavaPairRDD.collectAsMap from checkpoint RDD may fail with ClassCastException


[ 
https://issues.apache.org/jira/browse/SPARK-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1462#comment-1462
 ] 

Josh Rosen commented on SPARK-4489:
---

It looks like this is still a legitimate issue; the underlying bug is due to 
the Java API's handling of ClassTags plus incomplete test coverage for the Java 
API.  Regarding the ClassTag workaround in the gist, I think that you might be 
able to use the {{retag()}} method that I added in the fix to SPARK-1040 to 
quickly fix this.  I may be able to take a look at this reproduction later, but 
I'm going to leave this unassigned for now since it would be a great starter 
task for someone to pick up.

 JavaPairRDD.collectAsMap from checkpoint RDD may fail with ClassCastException
 -

 Key: SPARK-4489
 URL: https://issues.apache.org/jira/browse/SPARK-4489
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.1.0
Reporter: Christopher Ng

 Calling collectAsMap() on a JavaPairRDD reconstructed from a checkpoint fails 
 with a ClassCastException:
 Exception in thread main java.lang.ClassCastException: [Ljava.lang.Object; 
 cannot be cast to [Lscala.Tuple2;
   at 
 org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:595)
   at 
 org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala:569)
   at org.facboy.spark.CheckpointBug.main(CheckpointBug.java:46)
 Code sample reproducing the issue: 
 https://gist.github.com/facboy/8387e950ffb0746a8272



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4559) Adding support for ucase and lcase

wangfei created SPARK-4559:
--

 Summary: Adding support for ucase and lcase
 Key: SPARK-4559
 URL: https://issues.apache.org/jira/browse/SPARK-4559
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


Adding support for ucase and lcase in spark sql



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4519) Filestream does not use hadoop configuration set within sparkContext.hadoopConfiguration


[ 
https://issues.apache.org/jira/browse/SPARK-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1474#comment-1474
 ] 

Apache Spark commented on SPARK-4519:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/3419

 Filestream does not use hadoop configuration set within 
 sparkContext.hadoopConfiguration
 

 Key: SPARK-4519
 URL: https://issues.apache.org/jira/browse/SPARK-4519
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2, 1.1.1
Reporter: Tathagata Das
Assignee: Tathagata Das





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4518) Filestream sometimes processes files twice


[ 
https://issues.apache.org/jira/browse/SPARK-4518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1473#comment-1473
 ] 

Apache Spark commented on SPARK-4518:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/3419

 Filestream sometimes processes files twice
 --

 Key: SPARK-4518
 URL: https://issues.apache.org/jira/browse/SPARK-4518
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2, 1.1.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4517) Improve memory efficiency for python broadcast


[ 
https://issues.apache.org/jira/browse/SPARK-4517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1475#comment-1475
 ] 

Apache Spark commented on SPARK-4517:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3417

 Improve memory efficiency for python broadcast
 --

 Key: SPARK-4517
 URL: https://issues.apache.org/jira/browse/SPARK-4517
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu

 Currently, the Python broadcast (TorrentBroadcast) will have multiple copies 
 in :
 1) 1 copy in python driver
 2) 1 copy in disks of driver (serialized and compressed)
 3) 2 copies in JVM driver (one is unserialized, one is serialized and 
 compressed)
 4) 2 copies in executor (one is unserialized, one is serialized and 
 compressed)
 5) one copy in each python worker.
 Some of them are different in HTTPBroadcast:
 3)  one copy in memory of driver, one copy in disk (serialized and compressed)
 4) one copy in memory of executor
 If the python broadcast is 4G, then it need 12G in driver, and 8+4x G in 
 executor (x is the number of python worker, it's the number of CPUs usually).
 The Python broadcast is already serialized and compressed in Python, it 
 should not be serialized and compressed again in JVM. Also, JVM does not need 
 to know the content of it, so it could be out of JVM.
 So, we should have specified broadcast implementation for Python, it stores 
 the serialized and compressed data in disks, transferred to executors in p2p 
 way (similar to TorrentBroadcast), sent to python workers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4560) Lambda deserialization error

2014-11-22 Thread Alexis Seigneurin (JIRA)

Alexis Seigneurin created SPARK-4560:


 Summary: Lambda deserialization error
 Key: SPARK-4560
 URL: https://issues.apache.org/jira/browse/SPARK-4560
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
 Environment: Java 8.0.25
Reporter: Alexis Seigneurin


I'm getting an error saying a lambda could not be deserialized. Here is the 
code:

{code}
TwitterUtils.createStream(sc, twitterAuth, filters)
.map(t - t.getText())
.foreachRDD(tweets - {
tweets.foreach(x - System.out.println(x));
return null;
});
{code}

Here is the exception:

{noformat}
java.io.IOException: unexpected exception type
at 
java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538)
at 
java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1110)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1810)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
... 27 more
Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization
at 
com.seigneurin.spark.IndexTweets.$deserializeLambda$(IndexTweets.java:1)
... 37 more
{noformat}

The weird thing is, if I write the following code (the map operation is inside 
the foreachRDD), it works without problem.

{code}
TwitterUtils.createStream(sc, twitterAuth, filters)
.foreachRDD(tweets - {
tweets.map(t - t.getText())
.foreach(x - System.out.println(x));
return null;
});
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4560) Lambda deserialization error

2014-11-22 Thread Alexis Seigneurin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis Seigneurin updated SPARK-4560:
-
Attachment: IndexTweets.java
pom.xml

I'm attaching the class I'm using and Maven's pom.xml file so that you can 
reproduce the issue.

 Lambda deserialization error
 

 Key: SPARK-4560
 URL: https://issues.apache.org/jira/browse/SPARK-4560
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
 Environment: Java 8.0.25
Reporter: Alexis Seigneurin
 Attachments: IndexTweets.java, pom.xml


 I'm getting an error saying a lambda could not be deserialized. Here is the 
 code:
 {code}
 TwitterUtils.createStream(sc, twitterAuth, filters)
 .map(t - t.getText())
 .foreachRDD(tweets - {
 tweets.foreach(x - System.out.println(x));
 return null;
 });
 {code}
 Here is the exception:
 {noformat}
 java.io.IOException: unexpected exception type
   at 
 java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538)
   at 
 java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1110)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1810)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at 
 java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at 
 java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
   ... 27 more
 Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization
   at 
 com.seigneurin.spark.IndexTweets.$deserializeLambda$(IndexTweets.java:1)
   ... 37 more
 {noformat}
 The weird thing is, if I write the following code (the map operation is 
 inside the foreachRDD), it works without problem.
 {code}
 TwitterUtils.createStream(sc, twitterAuth, filters)
 .foreachRDD(tweets - {
 tweets.map(t - t.getText())
 .foreach(x - System.out.println(x));
 return

[jira] [Commented] (SPARK-4560) Lambda deserialization error

2014-11-22 Thread Alexis Seigneurin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1482#comment-1482
 ] 

Alexis Seigneurin commented on SPARK-4560:
--

It looks like the foreach() method is causing an issue. If i replace it with a 
call to count(), it works fine:

{code}
TwitterUtils.createStream(sc, twitterAuth, filters)
.map(t - t.getText())
.foreachRDD(tweets - {
System.out.println(tweets.count());
return null;
});
{code}

 Lambda deserialization error
 

 Key: SPARK-4560
 URL: https://issues.apache.org/jira/browse/SPARK-4560
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
 Environment: Java 8.0.25
Reporter: Alexis Seigneurin
 Attachments: IndexTweets.java, pom.xml


 I'm getting an error saying a lambda could not be deserialized. Here is the 
 code:
 {code}
 TwitterUtils.createStream(sc, twitterAuth, filters)
 .map(t - t.getText())
 .foreachRDD(tweets - {
 tweets.foreach(x - System.out.println(x));
 return null;
 });
 {code}
 Here is the exception:
 {noformat}
 java.io.IOException: unexpected exception type
   at 
 java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538)
   at 
 java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1110)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1810)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at 
 java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at 
 java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
   ... 27 more
 Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization
   at 
 com.seigneurin.spark.IndexTweets.$deserializeLambda$(IndexTweets.java:1)
   ... 37 more
 {noformat}
 The weird thing is, if I write the following code (the map operation is 
 inside the foreachRDD), it

[jira] [Resolved] (SPARK-4377) ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to deserialize a serialized ActorRef without an ActorSystem in scope.


 [ 
https://issues.apache.org/jira/browse/SPARK-4377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4377.

Resolution: Fixed

 ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to 
 deserialize a serialized ActorRef without an ActorSystem in scope.
 -

 Key: SPARK-4377
 URL: https://issues.apache.org/jira/browse/SPARK-4377
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.3.0
Reporter: Josh Rosen
Assignee: Prashant Sharma
Priority: Blocker
 Fix For: 1.3.0


 It looks like ZooKeeperPersistenceEngine is broken in the current Spark 
 master (23f5bdf06a388e08ea5a69e848f0ecd5165aa481).  Here's a log excerpt from 
 a secondary master when it takes over from a failed primary master:
 {code}
 14/11/13 04:37:12 WARN ConnectionStateManager: There are no 
 ConnectionStateListeners registered.
 14/11/13 04:37:19 INFO Master: Registering worker 172.17.0.223: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:20 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.223: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:43 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:47 INFO Master: Registering worker 172.17.0.223: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:51 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.223: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:38:06 INFO ZooKeeperLeaderElectionAgent: We have gained leadership
 14/11/13 04:38:06 WARN ZooKeeperPersistenceEngine: Exception while reading 
 persisted file, deleting
 java.io.IOException: java.lang.IllegalStateException: Trying to deserialize a 
 serialized ActorRef without an ActorSystem in scope. Use 
 'akka.serialization.Serialization.currentSystem.withValue(system) { ... }'
   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:988)
   at 
 org.apache.spark.deploy.master.ApplicationInfo.readObject(ApplicationInfo.scala:51)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.deserializeFromFile(ZooKeeperPersistenceEngine.scala:69)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:54)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:32)
   at 
 org.apache.spark.deploy.master.PersistenceEngine$class.readPersistedData(PersistenceEngine.scala:84)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.readPersistedData(ZooKeeperPersistenceEngine.scala:32)
   at

[jira] [Updated] (SPARK-4377) ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to deserialize a serialized ActorRef without an ActorSystem in scope.


 [ 
https://issues.apache.org/jira/browse/SPARK-4377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4377:
---
Target Version/s:   (was: 1.2.0)

 ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to 
 deserialize a serialized ActorRef without an ActorSystem in scope.
 -

 Key: SPARK-4377
 URL: https://issues.apache.org/jira/browse/SPARK-4377
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.3.0
Reporter: Josh Rosen
Assignee: Prashant Sharma
Priority: Blocker
 Fix For: 1.3.0


 It looks like ZooKeeperPersistenceEngine is broken in the current Spark 
 master (23f5bdf06a388e08ea5a69e848f0ecd5165aa481).  Here's a log excerpt from 
 a secondary master when it takes over from a failed primary master:
 {code}
 14/11/13 04:37:12 WARN ConnectionStateManager: There are no 
 ConnectionStateListeners registered.
 14/11/13 04:37:19 INFO Master: Registering worker 172.17.0.223: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:20 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.223: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:43 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:47 INFO Master: Registering worker 172.17.0.223: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:51 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.223: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:38:06 INFO ZooKeeperLeaderElectionAgent: We have gained leadership
 14/11/13 04:38:06 WARN ZooKeeperPersistenceEngine: Exception while reading 
 persisted file, deleting
 java.io.IOException: java.lang.IllegalStateException: Trying to deserialize a 
 serialized ActorRef without an ActorSystem in scope. Use 
 'akka.serialization.Serialization.currentSystem.withValue(system) { ... }'
   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:988)
   at 
 org.apache.spark.deploy.master.ApplicationInfo.readObject(ApplicationInfo.scala:51)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.deserializeFromFile(ZooKeeperPersistenceEngine.scala:69)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:54)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:32)
   at 
 org.apache.spark.deploy.master.PersistenceEngine$class.readPersistedData(PersistenceEngine.scala:84)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.readPersistedData(ZooKeeperPersistenceEngine.scala:32)
   at

[jira] [Updated] (SPARK-4377) ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to deserialize a serialized ActorRef without an ActorSystem in scope.


 [ 
https://issues.apache.org/jira/browse/SPARK-4377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4377:
---
Affects Version/s: (was: 1.2.0)
   1.3.0

 ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to 
 deserialize a serialized ActorRef without an ActorSystem in scope.
 -

 Key: SPARK-4377
 URL: https://issues.apache.org/jira/browse/SPARK-4377
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.3.0
Reporter: Josh Rosen
Assignee: Prashant Sharma
Priority: Blocker
 Fix For: 1.3.0


 It looks like ZooKeeperPersistenceEngine is broken in the current Spark 
 master (23f5bdf06a388e08ea5a69e848f0ecd5165aa481).  Here's a log excerpt from 
 a secondary master when it takes over from a failed primary master:
 {code}
 14/11/13 04:37:12 WARN ConnectionStateManager: There are no 
 ConnectionStateListeners registered.
 14/11/13 04:37:19 INFO Master: Registering worker 172.17.0.223: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:20 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.223: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:43 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:47 INFO Master: Registering worker 172.17.0.223: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:51 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.223: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.224: with 8 
 cores, 984.0 MB RAM
 14/11/13 04:38:06 INFO ZooKeeperLeaderElectionAgent: We have gained leadership
 14/11/13 04:38:06 WARN ZooKeeperPersistenceEngine: Exception while reading 
 persisted file, deleting
 java.io.IOException: java.lang.IllegalStateException: Trying to deserialize a 
 serialized ActorRef without an ActorSystem in scope. Use 
 'akka.serialization.Serialization.currentSystem.withValue(system) { ... }'
   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:988)
   at 
 org.apache.spark.deploy.master.ApplicationInfo.readObject(ApplicationInfo.scala:51)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.deserializeFromFile(ZooKeeperPersistenceEngine.scala:69)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:54)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:32)
   at 
 org.apache.spark.deploy.master.PersistenceEngine$class.readPersistedData(PersistenceEngine.scala:84)
   at 
 org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.readPersistedData(ZooKeeperPersistenceEngine.scala:32)
   at

[jira] [Created] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries

Josh Rosen created SPARK-4561:
-

 Summary: PySparkSQL's Row.asDict() should convert nested rows to 
dictionaries
 Key: SPARK-4561
 URL: https://issues.apache.org/jira/browse/SPARK-4561
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.2.0
Reporter: Josh Rosen


In PySpark, you can call {{.asDict
()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
though, this does not convert nested rows to dictionaries.  For example:

{code}
 sqlContext.sql(select results from results).first()
Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), Row(time=3.458), 
Row(time=3.229), Row(time=3.21), Row(time=3.166), Row(time=3.276), 
Row(time=3.239), Row(time=3.149)])
 sqlContext.sql(select results from results).first().asDict()
{u'results': [(3.762,),
  (3.47,),
  (3.559,),
  (3.458,),
  (3.229,),
  (3.21,),
  (3.166,),
  (3.276,),
  (3.239,),
  (3.149,)]}
{code}

I ran into this issue when trying to use Pandas dataframes to display nested 
data that I queried from Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries


 [ 
https://issues.apache.org/jira/browse/SPARK-4561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4561:
--
Description: 
In PySpark, you can call {{.asDict
()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
though, this does not convert nested rows to dictionaries.  For example:

{code}
 sqlContext.sql(select results from results).first()
Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), Row(time=3.458), 
Row(time=3.229), Row(time=3.21), Row(time=3.166), Row(time=3.276), 
Row(time=3.239), Row(time=3.149)])
 sqlContext.sql(select results from results).first().asDict()
{u'results': [(3.762,),
  (3.47,),
  (3.559,),
  (3.458,),
  (3.229,),
  (3.21,),
  (3.166,),
  (3.276,),
  (3.239,),
  (3.149,)]}
{code}


Actually, it looks like the nested fields are just left as Rows (IPython's 
fancy display logic obscured this in my first example):

{code}
 Row(results=[Row(time=1), Row(time=2)]).asDict()
{'results': [Row(time=1), Row(time=2)]}
{code}

I ran into this issue when trying to use Pandas dataframes to display nested 
data that I queried from Spark SQL.

  was:
In PySpark, you can call {{.asDict
()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
though, this does not convert nested rows to dictionaries.  For example:

{code}
 sqlContext.sql(select results from results).first()
Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), Row(time=3.458), 
Row(time=3.229), Row(time=3.21), Row(time=3.166), Row(time=3.276), 
Row(time=3.239), Row(time=3.149)])
 sqlContext.sql(select results from results).first().asDict()
{u'results': [(3.762,),
  (3.47,),
  (3.559,),
  (3.458,),
  (3.229,),
  (3.21,),
  (3.166,),
  (3.276,),
  (3.239,),
  (3.149,)]}
{code}

I ran into this issue when trying to use Pandas dataframes to display nested 
data that I queried from Spark SQL.


 PySparkSQL's Row.asDict() should convert nested rows to dictionaries
 

 Key: SPARK-4561
 URL: https://issues.apache.org/jira/browse/SPARK-4561
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.2.0
Reporter: Josh Rosen

 In PySpark, you can call {{.asDict
 ()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
 though, this does not convert nested rows to dictionaries.  For example:
 {code}
  sqlContext.sql(select results from results).first()
 Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), 
 Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), 
 Row(time=3.276), Row(time=3.239), Row(time=3.149)])
  sqlContext.sql(select results from results).first().asDict()
 {u'results': [(3.762,),
   (3.47,),
   (3.559,),
   (3.458,),
   (3.229,),
   (3.21,),
   (3.166,),
   (3.276,),
   (3.239,),
   (3.149,)]}
 {code}
 Actually, it looks like the nested fields are just left as Rows (IPython's 
 fancy display logic obscured this in my first example):
 {code}
  Row(results=[Row(time=1), Row(time=2)]).asDict()
 {'results': [Row(time=1), Row(time=2)]}
 {code}
 I ran into this issue when trying to use Pandas dataframes to display nested 
 data that I queried from Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries


 [ 
https://issues.apache.org/jira/browse/SPARK-4561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4561:
--
Description: 
In PySpark, you can call {{.asDict
()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
though, this does not convert nested rows to dictionaries.  For example:

{code}
 sqlContext.sql(select results from results).first()
Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), Row(time=3.458), 
Row(time=3.229), Row(time=3.21), Row(time=3.166), Row(time=3.276), 
Row(time=3.239), Row(time=3.149)])
 sqlContext.sql(select results from results).first().asDict()
{u'results': [(3.762,),
  (3.47,),
  (3.559,),
  (3.458,),
  (3.229,),
  (3.21,),
  (3.166,),
  (3.276,),
  (3.239,),
  (3.149,)]}
{code}


Actually, it looks like the nested fields are just left as Rows (IPython's 
fancy display logic obscured this in my first example):

{code}
 Row(results=[Row(time=1), Row(time=2)]).asDict()
{'results': [Row(time=1), Row(time=2)]}
{code}

Here's the output I'd expect:

{code}
 Row(results=[Row(time=1), Row(time=2)])
{'results' : [{'time': 1}, {'time': 2}]}
{code}

I ran into this issue when trying to use Pandas dataframes to display nested 
data that I queried from Spark SQL.

  was:
In PySpark, you can call {{.asDict
()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
though, this does not convert nested rows to dictionaries.  For example:

{code}
 sqlContext.sql(select results from results).first()
Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), Row(time=3.458), 
Row(time=3.229), Row(time=3.21), Row(time=3.166), Row(time=3.276), 
Row(time=3.239), Row(time=3.149)])
 sqlContext.sql(select results from results).first().asDict()
{u'results': [(3.762,),
  (3.47,),
  (3.559,),
  (3.458,),
  (3.229,),
  (3.21,),
  (3.166,),
  (3.276,),
  (3.239,),
  (3.149,)]}
{code}


Actually, it looks like the nested fields are just left as Rows (IPython's 
fancy display logic obscured this in my first example):

{code}
 Row(results=[Row(time=1), Row(time=2)]).asDict()
{'results': [Row(time=1), Row(time=2)]}
{code}

I ran into this issue when trying to use Pandas dataframes to display nested 
data that I queried from Spark SQL.


 PySparkSQL's Row.asDict() should convert nested rows to dictionaries
 

 Key: SPARK-4561
 URL: https://issues.apache.org/jira/browse/SPARK-4561
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.2.0
Reporter: Josh Rosen

 In PySpark, you can call {{.asDict
 ()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
 though, this does not convert nested rows to dictionaries.  For example:
 {code}
  sqlContext.sql(select results from results).first()
 Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), 
 Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), 
 Row(time=3.276), Row(time=3.239), Row(time=3.149)])
  sqlContext.sql(select results from results).first().asDict()
 {u'results': [(3.762,),
   (3.47,),
   (3.559,),
   (3.458,),
   (3.229,),
   (3.21,),
   (3.166,),
   (3.276,),
   (3.239,),
   (3.149,)]}
 {code}
 Actually, it looks like the nested fields are just left as Rows (IPython's 
 fancy display logic obscured this in my first example):
 {code}
  Row(results=[Row(time=1), Row(time=2)]).asDict()
 {'results': [Row(time=1), Row(time=2)]}
 {code}
 Here's the output I'd expect:
 {code}
  Row(results=[Row(time=1), Row(time=2)])
 {'results' : [{'time': 1}, {'time': 2}]}
 {code}
 I ran into this issue when trying to use Pandas dataframes to display nested 
 data that I queried from Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries


 [ 
https://issues.apache.org/jira/browse/SPARK-4561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4561:
--
Target Version/s: 1.2.0
Assignee: Davies Liu

[~davies], could you take a look at this since you're more familiar with this 
code than me?  It might be nice to squeeze a fix for this into 1.2.0 before 
this API becomes stable.

I noticed that there's two {{asDict()}} methods, one in each {{Row}} class; is 
there a way to avoid this duplication?  Also, could we maybe add some 
user-facing doctests to this, e.g.

{code}
def asDict(self):

Return this row as a dictionary.

 Row(name='Alice', age=11).asDict()
{'age': 11, 'name': 'Alice'}

Nested rows will be converted into nested dictionaries:

 Row(results=[Row(time=1), Row(time=2)])
{'results' : [{'time': 1}, {'time': 2}]}

{code}

 PySparkSQL's Row.asDict() should convert nested rows to dictionaries
 

 Key: SPARK-4561
 URL: https://issues.apache.org/jira/browse/SPARK-4561
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.2.0
Reporter: Josh Rosen
Assignee: Davies Liu

 In PySpark, you can call {{.asDict
 ()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
 though, this does not convert nested rows to dictionaries.  For example:
 {code}
  sqlContext.sql(select results from results).first()
 Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), 
 Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), 
 Row(time=3.276), Row(time=3.239), Row(time=3.149)])
  sqlContext.sql(select results from results).first().asDict()
 {u'results': [(3.762,),
   (3.47,),
   (3.559,),
   (3.458,),
   (3.229,),
   (3.21,),
   (3.166,),
   (3.276,),
   (3.239,),
   (3.149,)]}
 {code}
 Actually, it looks like the nested fields are just left as Rows (IPython's 
 fancy display logic obscured this in my first example):
 {code}
  Row(results=[Row(time=1), Row(time=2)]).asDict()
 {'results': [Row(time=1), Row(time=2)]}
 {code}
 Here's the output I'd expect:
 {code}
  Row(results=[Row(time=1), Row(time=2)])
 {'results' : [{'time': 1}, {'time': 2}]}
 {code}
 I ran into this issue when trying to use Pandas dataframes to display nested 
 data that I queried from Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries

2014-11-22 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222310#comment-14222310
 ] 

Davies Liu commented on SPARK-4561:
---

I tried to do it, but found that it's not easy, bacause Row() could be nested 
in MapType and ArrayType (even UDT), it also could be expensive.

Maybe we need to do it optional, using recursive=True?

 PySparkSQL's Row.asDict() should convert nested rows to dictionaries
 

 Key: SPARK-4561
 URL: https://issues.apache.org/jira/browse/SPARK-4561
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.2.0
Reporter: Josh Rosen
Assignee: Davies Liu

 In PySpark, you can call {{.asDict
 ()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
 though, this does not convert nested rows to dictionaries.  For example:
 {code}
  sqlContext.sql(select results from results).first()
 Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), 
 Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), 
 Row(time=3.276), Row(time=3.239), Row(time=3.149)])
  sqlContext.sql(select results from results).first().asDict()
 {u'results': [(3.762,),
   (3.47,),
   (3.559,),
   (3.458,),
   (3.229,),
   (3.21,),
   (3.166,),
   (3.276,),
   (3.239,),
   (3.149,)]}
 {code}
 Actually, it looks like the nested fields are just left as Rows (IPython's 
 fancy display logic obscured this in my first example):
 {code}
  Row(results=[Row(time=1), Row(time=2)]).asDict()
 {'results': [Row(time=1), Row(time=2)]}
 {code}
 Here's the output I'd expect:
 {code}
  Row(results=[Row(time=1), Row(time=2)])
 {'results' : [{'time': 1}, {'time': 2}]}
 {code}
 I ran into this issue when trying to use Pandas dataframes to display nested 
 data that I queried from Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries