[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-11-22 Thread Aaron Staple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221233#comment-14221233
 ] 

Aaron Staple edited comment on SPARK-1503 at 11/23/14 6:55 AM:
---

[~mengxr] Sorry for the delay. I wrote up a design proposal for the initial 
implementation. Let me know what you think, and if you'd like me to clarify 
anything.

UPDATE: Ok, here's the document:
https://docs.google.com/document/d/1L50O66LnBfVopFjptbet2ZTQRzriZTjKvlIILZwKsno/edit?usp=sharing


was (Author: staple):
[~mengxr] Sorry for the delay. I wrote up a design proposal for the initial 
implementation. Let me know what you think, and if you'd like me to clarify 
anything.

UPDATE: On second thought, I'd actually like to make a few changes to the 
proposal. I'll follow up tomorrow with the updated version. Sorry for the 
confusion.

> Implement Nesterov's accelerated first-order method
> ---
>
> Key: SPARK-1503
> URL: https://issues.apache.org/jira/browse/SPARK-1503
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Aaron Staple
>
> Nesterov's accelerated first-order method is a drop-in replacement for 
> steepest descent but it converges much faster. We should implement this 
> method and compare its performance with existing algorithms, including SGD 
> and L-BFGS.
> TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
> method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4417) New API: sample RDD to fixed number of items

2014-11-22 Thread Sandeep Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222332#comment-14222332
 ] 

Sandeep Singh commented on SPARK-4417:
--

Can you assign this to me ?

> New API: sample RDD to fixed number of items
> 
>
> Key: SPARK-4417
> URL: https://issues.apache.org/jira/browse/SPARK-4417
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Reporter: Davies Liu
>
> Sometimes, we just want to a fixed number of items randomly selected from an 
> RDD, for example, before sort an RDD we need to gather a fixed number of keys 
> from each partitions.
> In order to do this, we need to two pass on the RDD, get the total number, 
> then calculate the right ratio for sampling. In fact, we could do this in one 
> pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries

2014-11-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4561:
--
Target Version/s: 1.3.0  (was: 1.2.0)

Good point; if we add a {{recursive}} option and have recursion off by default, 
then it's not urgent to fix this now since the new option will be 
backwards-compatible with what we ship in 1.2.0.

> PySparkSQL's Row.asDict() should convert nested rows to dictionaries
> 
>
> Key: SPARK-4561
> URL: https://issues.apache.org/jira/browse/SPARK-4561
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.2.0
>Reporter: Josh Rosen
>Assignee: Davies Liu
>
> In PySpark, you can call {{.asDict
> ()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
> though, this does not convert nested rows to dictionaries.  For example:
> {code}
> >>> sqlContext.sql("select results from results").first()
> Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), 
> Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), 
> Row(time=3.276), Row(time=3.239), Row(time=3.149)])
> >>> sqlContext.sql("select results from results").first().asDict()
> {u'results': [(3.762,),
>   (3.47,),
>   (3.559,),
>   (3.458,),
>   (3.229,),
>   (3.21,),
>   (3.166,),
>   (3.276,),
>   (3.239,),
>   (3.149,)]}
> {code}
> Actually, it looks like the nested fields are just left as Rows (IPython's 
> fancy display logic obscured this in my first example):
> {code}
> >>> Row(results=[Row(time=1), Row(time=2)]).asDict()
> {'results': [Row(time=1), Row(time=2)]}
> {code}
> Here's the output I'd expect:
> {code}
> >>> Row(results=[Row(time=1), Row(time=2)])
> {'results' : [{'time': 1}, {'time': 2}]}
> {code}
> I ran into this issue when trying to use Pandas dataframes to display nested 
> data that I queried from Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries

2014-11-22 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222310#comment-14222310
 ] 

Davies Liu commented on SPARK-4561:
---

I tried to do it, but found that it's not easy, bacause Row() could be nested 
in MapType and ArrayType (even UDT), it also could be expensive.

Maybe we need to do it optional, using recursive=True?

> PySparkSQL's Row.asDict() should convert nested rows to dictionaries
> 
>
> Key: SPARK-4561
> URL: https://issues.apache.org/jira/browse/SPARK-4561
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.2.0
>Reporter: Josh Rosen
>Assignee: Davies Liu
>
> In PySpark, you can call {{.asDict
> ()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
> though, this does not convert nested rows to dictionaries.  For example:
> {code}
> >>> sqlContext.sql("select results from results").first()
> Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), 
> Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), 
> Row(time=3.276), Row(time=3.239), Row(time=3.149)])
> >>> sqlContext.sql("select results from results").first().asDict()
> {u'results': [(3.762,),
>   (3.47,),
>   (3.559,),
>   (3.458,),
>   (3.229,),
>   (3.21,),
>   (3.166,),
>   (3.276,),
>   (3.239,),
>   (3.149,)]}
> {code}
> Actually, it looks like the nested fields are just left as Rows (IPython's 
> fancy display logic obscured this in my first example):
> {code}
> >>> Row(results=[Row(time=1), Row(time=2)]).asDict()
> {'results': [Row(time=1), Row(time=2)]}
> {code}
> Here's the output I'd expect:
> {code}
> >>> Row(results=[Row(time=1), Row(time=2)])
> {'results' : [{'time': 1}, {'time': 2}]}
> {code}
> I ran into this issue when trying to use Pandas dataframes to display nested 
> data that I queried from Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries

2014-11-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4561:
--
Target Version/s: 1.2.0
Assignee: Davies Liu

[~davies], could you take a look at this since you're more familiar with this 
code than me?  It might be nice to squeeze a fix for this into 1.2.0 before 
this API becomes stable.

I noticed that there's two {{asDict()}} methods, one in each {{Row}} class; is 
there a way to avoid this duplication?  Also, could we maybe add some 
user-facing doctests to this, e.g.

{code}
def asDict(self):
"""
Return this row as a dictionary.

>>> Row(name='Alice', age=11).asDict()
{'age': 11, 'name': 'Alice'}

Nested rows will be converted into nested dictionaries:

>>> Row(results=[Row(time=1), Row(time=2)])
{'results' : [{'time': 1}, {'time': 2}]}
"""
{code}

> PySparkSQL's Row.asDict() should convert nested rows to dictionaries
> 
>
> Key: SPARK-4561
> URL: https://issues.apache.org/jira/browse/SPARK-4561
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.2.0
>Reporter: Josh Rosen
>Assignee: Davies Liu
>
> In PySpark, you can call {{.asDict
> ()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
> though, this does not convert nested rows to dictionaries.  For example:
> {code}
> >>> sqlContext.sql("select results from results").first()
> Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), 
> Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), 
> Row(time=3.276), Row(time=3.239), Row(time=3.149)])
> >>> sqlContext.sql("select results from results").first().asDict()
> {u'results': [(3.762,),
>   (3.47,),
>   (3.559,),
>   (3.458,),
>   (3.229,),
>   (3.21,),
>   (3.166,),
>   (3.276,),
>   (3.239,),
>   (3.149,)]}
> {code}
> Actually, it looks like the nested fields are just left as Rows (IPython's 
> fancy display logic obscured this in my first example):
> {code}
> >>> Row(results=[Row(time=1), Row(time=2)]).asDict()
> {'results': [Row(time=1), Row(time=2)]}
> {code}
> Here's the output I'd expect:
> {code}
> >>> Row(results=[Row(time=1), Row(time=2)])
> {'results' : [{'time': 1}, {'time': 2}]}
> {code}
> I ran into this issue when trying to use Pandas dataframes to display nested 
> data that I queried from Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries

2014-11-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4561:
--
Description: 
In PySpark, you can call {{.asDict
()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
though, this does not convert nested rows to dictionaries.  For example:

{code}
>>> sqlContext.sql("select results from results").first()
Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), Row(time=3.458), 
Row(time=3.229), Row(time=3.21), Row(time=3.166), Row(time=3.276), 
Row(time=3.239), Row(time=3.149)])
>>> sqlContext.sql("select results from results").first().asDict()
{u'results': [(3.762,),
  (3.47,),
  (3.559,),
  (3.458,),
  (3.229,),
  (3.21,),
  (3.166,),
  (3.276,),
  (3.239,),
  (3.149,)]}
{code}


Actually, it looks like the nested fields are just left as Rows (IPython's 
fancy display logic obscured this in my first example):

{code}
>>> Row(results=[Row(time=1), Row(time=2)]).asDict()
{'results': [Row(time=1), Row(time=2)]}
{code}

Here's the output I'd expect:

{code}
>>> Row(results=[Row(time=1), Row(time=2)])
{'results' : [{'time': 1}, {'time': 2}]}
{code}

I ran into this issue when trying to use Pandas dataframes to display nested 
data that I queried from Spark SQL.

  was:
In PySpark, you can call {{.asDict
()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
though, this does not convert nested rows to dictionaries.  For example:

{code}
>>> sqlContext.sql("select results from results").first()
Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), Row(time=3.458), 
Row(time=3.229), Row(time=3.21), Row(time=3.166), Row(time=3.276), 
Row(time=3.239), Row(time=3.149)])
>>> sqlContext.sql("select results from results").first().asDict()
{u'results': [(3.762,),
  (3.47,),
  (3.559,),
  (3.458,),
  (3.229,),
  (3.21,),
  (3.166,),
  (3.276,),
  (3.239,),
  (3.149,)]}
{code}


Actually, it looks like the nested fields are just left as Rows (IPython's 
fancy display logic obscured this in my first example):

{code}
>>> Row(results=[Row(time=1), Row(time=2)]).asDict()
{'results': [Row(time=1), Row(time=2)]}
{code}

I ran into this issue when trying to use Pandas dataframes to display nested 
data that I queried from Spark SQL.


> PySparkSQL's Row.asDict() should convert nested rows to dictionaries
> 
>
> Key: SPARK-4561
> URL: https://issues.apache.org/jira/browse/SPARK-4561
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.2.0
>Reporter: Josh Rosen
>
> In PySpark, you can call {{.asDict
> ()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
> though, this does not convert nested rows to dictionaries.  For example:
> {code}
> >>> sqlContext.sql("select results from results").first()
> Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), 
> Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), 
> Row(time=3.276), Row(time=3.239), Row(time=3.149)])
> >>> sqlContext.sql("select results from results").first().asDict()
> {u'results': [(3.762,),
>   (3.47,),
>   (3.559,),
>   (3.458,),
>   (3.229,),
>   (3.21,),
>   (3.166,),
>   (3.276,),
>   (3.239,),
>   (3.149,)]}
> {code}
> Actually, it looks like the nested fields are just left as Rows (IPython's 
> fancy display logic obscured this in my first example):
> {code}
> >>> Row(results=[Row(time=1), Row(time=2)]).asDict()
> {'results': [Row(time=1), Row(time=2)]}
> {code}
> Here's the output I'd expect:
> {code}
> >>> Row(results=[Row(time=1), Row(time=2)])
> {'results' : [{'time': 1}, {'time': 2}]}
> {code}
> I ran into this issue when trying to use Pandas dataframes to display nested 
> data that I queried from Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries

2014-11-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4561:
--
Description: 
In PySpark, you can call {{.asDict
()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
though, this does not convert nested rows to dictionaries.  For example:

{code}
>>> sqlContext.sql("select results from results").first()
Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), Row(time=3.458), 
Row(time=3.229), Row(time=3.21), Row(time=3.166), Row(time=3.276), 
Row(time=3.239), Row(time=3.149)])
>>> sqlContext.sql("select results from results").first().asDict()
{u'results': [(3.762,),
  (3.47,),
  (3.559,),
  (3.458,),
  (3.229,),
  (3.21,),
  (3.166,),
  (3.276,),
  (3.239,),
  (3.149,)]}
{code}


Actually, it looks like the nested fields are just left as Rows (IPython's 
fancy display logic obscured this in my first example):

{code}
>>> Row(results=[Row(time=1), Row(time=2)]).asDict()
{'results': [Row(time=1), Row(time=2)]}
{code}

I ran into this issue when trying to use Pandas dataframes to display nested 
data that I queried from Spark SQL.

  was:
In PySpark, you can call {{.asDict
()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
though, this does not convert nested rows to dictionaries.  For example:

{code}
>>> sqlContext.sql("select results from results").first()
Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), Row(time=3.458), 
Row(time=3.229), Row(time=3.21), Row(time=3.166), Row(time=3.276), 
Row(time=3.239), Row(time=3.149)])
>>> sqlContext.sql("select results from results").first().asDict()
{u'results': [(3.762,),
  (3.47,),
  (3.559,),
  (3.458,),
  (3.229,),
  (3.21,),
  (3.166,),
  (3.276,),
  (3.239,),
  (3.149,)]}
{code}

I ran into this issue when trying to use Pandas dataframes to display nested 
data that I queried from Spark SQL.


> PySparkSQL's Row.asDict() should convert nested rows to dictionaries
> 
>
> Key: SPARK-4561
> URL: https://issues.apache.org/jira/browse/SPARK-4561
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.2.0
>Reporter: Josh Rosen
>
> In PySpark, you can call {{.asDict
> ()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
> though, this does not convert nested rows to dictionaries.  For example:
> {code}
> >>> sqlContext.sql("select results from results").first()
> Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), 
> Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), 
> Row(time=3.276), Row(time=3.239), Row(time=3.149)])
> >>> sqlContext.sql("select results from results").first().asDict()
> {u'results': [(3.762,),
>   (3.47,),
>   (3.559,),
>   (3.458,),
>   (3.229,),
>   (3.21,),
>   (3.166,),
>   (3.276,),
>   (3.239,),
>   (3.149,)]}
> {code}
> Actually, it looks like the nested fields are just left as Rows (IPython's 
> fancy display logic obscured this in my first example):
> {code}
> >>> Row(results=[Row(time=1), Row(time=2)]).asDict()
> {'results': [Row(time=1), Row(time=2)]}
> {code}
> I ran into this issue when trying to use Pandas dataframes to display nested 
> data that I queried from Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries

2014-11-22 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-4561:
-

 Summary: PySparkSQL's Row.asDict() should convert nested rows to 
dictionaries
 Key: SPARK-4561
 URL: https://issues.apache.org/jira/browse/SPARK-4561
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.2.0
Reporter: Josh Rosen


In PySpark, you can call {{.asDict
()}} on a SparkSQL {{Row}} to convert it to a dictionary.  Unfortunately, 
though, this does not convert nested rows to dictionaries.  For example:

{code}
>>> sqlContext.sql("select results from results").first()
Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), Row(time=3.458), 
Row(time=3.229), Row(time=3.21), Row(time=3.166), Row(time=3.276), 
Row(time=3.239), Row(time=3.149)])
>>> sqlContext.sql("select results from results").first().asDict()
{u'results': [(3.762,),
  (3.47,),
  (3.559,),
  (3.458,),
  (3.229,),
  (3.21,),
  (3.166,),
  (3.276,),
  (3.239,),
  (3.149,)]}
{code}

I ran into this issue when trying to use Pandas dataframes to display nested 
data that I queried from Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4377) ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to deserialize a serialized ActorRef without an ActorSystem in scope.

2014-11-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4377:
---
Affects Version/s: (was: 1.2.0)
   1.3.0

> ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to 
> deserialize a serialized ActorRef without an ActorSystem in scope.
> -
>
> Key: SPARK-4377
> URL: https://issues.apache.org/jira/browse/SPARK-4377
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Assignee: Prashant Sharma
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It looks like ZooKeeperPersistenceEngine is broken in the current Spark 
> master (23f5bdf06a388e08ea5a69e848f0ecd5165aa481).  Here's a log excerpt from 
> a secondary master when it takes over from a failed primary master:
> {code}
> 14/11/13 04:37:12 WARN ConnectionStateManager: There are no 
> ConnectionStateListeners registered.
> 14/11/13 04:37:19 INFO Master: Registering worker 172.17.0.223: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:20 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.223: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:43 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:47 INFO Master: Registering worker 172.17.0.223: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:51 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.223: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:38:06 INFO ZooKeeperLeaderElectionAgent: We have gained leadership
> 14/11/13 04:38:06 WARN ZooKeeperPersistenceEngine: Exception while reading 
> persisted file, deleting
> java.io.IOException: java.lang.IllegalStateException: Trying to deserialize a 
> serialized ActorRef without an ActorSystem in scope. Use 
> 'akka.serialization.Serialization.currentSystem.withValue(system) { ... }'
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:988)
>   at 
> org.apache.spark.deploy.master.ApplicationInfo.readObject(ApplicationInfo.scala:51)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.deserializeFromFile(ZooKeeperPersistenceEngine.scala:69)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:54)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:32)
>   at 
> org.apache.spark.deploy.master.PersistenceEngine$class.readPersistedData(PersistenceEngine.scala:84)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.

[jira] [Updated] (SPARK-4377) ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to deserialize a serialized ActorRef without an ActorSystem in scope.

2014-11-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4377:
---
Target Version/s:   (was: 1.2.0)

> ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to 
> deserialize a serialized ActorRef without an ActorSystem in scope.
> -
>
> Key: SPARK-4377
> URL: https://issues.apache.org/jira/browse/SPARK-4377
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Assignee: Prashant Sharma
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It looks like ZooKeeperPersistenceEngine is broken in the current Spark 
> master (23f5bdf06a388e08ea5a69e848f0ecd5165aa481).  Here's a log excerpt from 
> a secondary master when it takes over from a failed primary master:
> {code}
> 14/11/13 04:37:12 WARN ConnectionStateManager: There are no 
> ConnectionStateListeners registered.
> 14/11/13 04:37:19 INFO Master: Registering worker 172.17.0.223: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:20 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.223: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:43 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:47 INFO Master: Registering worker 172.17.0.223: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:51 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.223: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:38:06 INFO ZooKeeperLeaderElectionAgent: We have gained leadership
> 14/11/13 04:38:06 WARN ZooKeeperPersistenceEngine: Exception while reading 
> persisted file, deleting
> java.io.IOException: java.lang.IllegalStateException: Trying to deserialize a 
> serialized ActorRef without an ActorSystem in scope. Use 
> 'akka.serialization.Serialization.currentSystem.withValue(system) { ... }'
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:988)
>   at 
> org.apache.spark.deploy.master.ApplicationInfo.readObject(ApplicationInfo.scala:51)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.deserializeFromFile(ZooKeeperPersistenceEngine.scala:69)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:54)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:32)
>   at 
> org.apache.spark.deploy.master.PersistenceEngine$class.readPersistedData(PersistenceEngine.scala:84)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.readPersistedData(ZooKeeperPersi

[jira] [Resolved] (SPARK-4377) ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to deserialize a serialized ActorRef without an ActorSystem in scope.

2014-11-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4377.

Resolution: Fixed

> ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to 
> deserialize a serialized ActorRef without an ActorSystem in scope.
> -
>
> Key: SPARK-4377
> URL: https://issues.apache.org/jira/browse/SPARK-4377
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.3.0
>Reporter: Josh Rosen
>Assignee: Prashant Sharma
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It looks like ZooKeeperPersistenceEngine is broken in the current Spark 
> master (23f5bdf06a388e08ea5a69e848f0ecd5165aa481).  Here's a log excerpt from 
> a secondary master when it takes over from a failed primary master:
> {code}
> 14/11/13 04:37:12 WARN ConnectionStateManager: There are no 
> ConnectionStateListeners registered.
> 14/11/13 04:37:19 INFO Master: Registering worker 172.17.0.223: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:20 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.223: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:43 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:47 INFO Master: Registering worker 172.17.0.223: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:51 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.223: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:38:06 INFO ZooKeeperLeaderElectionAgent: We have gained leadership
> 14/11/13 04:38:06 WARN ZooKeeperPersistenceEngine: Exception while reading 
> persisted file, deleting
> java.io.IOException: java.lang.IllegalStateException: Trying to deserialize a 
> serialized ActorRef without an ActorSystem in scope. Use 
> 'akka.serialization.Serialization.currentSystem.withValue(system) { ... }'
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:988)
>   at 
> org.apache.spark.deploy.master.ApplicationInfo.readObject(ApplicationInfo.scala:51)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.deserializeFromFile(ZooKeeperPersistenceEngine.scala:69)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:54)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:32)
>   at 
> org.apache.spark.deploy.master.PersistenceEngine$class.readPersistedData(PersistenceEngine.scala:84)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.readPersistedData(ZooKeeperPersistenceEngine.

[jira] [Commented] (SPARK-4560) Lambda deserialization error

2014-11-22 Thread Alexis Seigneurin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1482#comment-1482
 ] 

Alexis Seigneurin commented on SPARK-4560:
--

It looks like the foreach() method is causing an issue. If i replace it with a 
call to count(), it works fine:

{code}
TwitterUtils.createStream(sc, twitterAuth, filters)
.map(t -> t.getText())
.foreachRDD(tweets -> {
System.out.println(tweets.count());
return null;
});
{code}

> Lambda deserialization error
> 
>
> Key: SPARK-4560
> URL: https://issues.apache.org/jira/browse/SPARK-4560
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0
> Environment: Java 8.0.25
>Reporter: Alexis Seigneurin
> Attachments: IndexTweets.java, pom.xml
>
>
> I'm getting an error saying a lambda could not be deserialized. Here is the 
> code:
> {code}
> TwitterUtils.createStream(sc, twitterAuth, filters)
> .map(t -> t.getText())
> .foreachRDD(tweets -> {
> tweets.foreach(x -> System.out.println(x));
> return null;
> });
> {code}
> Here is the exception:
> {noformat}
> java.io.IOException: unexpected exception type
>   at 
> java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538)
>   at 
> java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1110)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1810)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
>   at org.apache.spark.scheduler.Task.run(Task.scala:54)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
>   ... 27 more
> Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization
>   at 
> com.seigneurin.spark.IndexTweets.$deserializeLambda$(IndexTweets.java:1)
>   ... 37 more
> {noformat}
> T

[jira] [Updated] (SPARK-4560) Lambda deserialization error

2014-11-22 Thread Alexis Seigneurin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis Seigneurin updated SPARK-4560:
-
Attachment: IndexTweets.java
pom.xml

I'm attaching the class I'm using and Maven's pom.xml file so that you can 
reproduce the issue.

> Lambda deserialization error
> 
>
> Key: SPARK-4560
> URL: https://issues.apache.org/jira/browse/SPARK-4560
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0
> Environment: Java 8.0.25
>Reporter: Alexis Seigneurin
> Attachments: IndexTweets.java, pom.xml
>
>
> I'm getting an error saying a lambda could not be deserialized. Here is the 
> code:
> {code}
> TwitterUtils.createStream(sc, twitterAuth, filters)
> .map(t -> t.getText())
> .foreachRDD(tweets -> {
> tweets.foreach(x -> System.out.println(x));
> return null;
> });
> {code}
> Here is the exception:
> {noformat}
> java.io.IOException: unexpected exception type
>   at 
> java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538)
>   at 
> java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1110)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1810)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
>   at org.apache.spark.scheduler.Task.run(Task.scala:54)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
>   ... 27 more
> Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization
>   at 
> com.seigneurin.spark.IndexTweets.$deserializeLambda$(IndexTweets.java:1)
>   ... 37 more
> {noformat}
> The weird thing is, if I write the following code (the map operation is 
> inside the foreachRDD), it works without problem.
> {code}
> TwitterUtils.createStream(sc, twitterAuth, filters)
> .foreachRDD(tweets -> {
> tweets.map(t -> t.g

[jira] [Created] (SPARK-4560) Lambda deserialization error

2014-11-22 Thread Alexis Seigneurin (JIRA)
Alexis Seigneurin created SPARK-4560:


 Summary: Lambda deserialization error
 Key: SPARK-4560
 URL: https://issues.apache.org/jira/browse/SPARK-4560
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
 Environment: Java 8.0.25
Reporter: Alexis Seigneurin


I'm getting an error saying a lambda could not be deserialized. Here is the 
code:

{code}
TwitterUtils.createStream(sc, twitterAuth, filters)
.map(t -> t.getText())
.foreachRDD(tweets -> {
tweets.foreach(x -> System.out.println(x));
return null;
});
{code}

Here is the exception:

{noformat}
java.io.IOException: unexpected exception type
at 
java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538)
at 
java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1110)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1810)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104)
... 27 more
Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization
at 
com.seigneurin.spark.IndexTweets.$deserializeLambda$(IndexTweets.java:1)
... 37 more
{noformat}

The weird thing is, if I write the following code (the map operation is inside 
the foreachRDD), it works without problem.

{code}
TwitterUtils.createStream(sc, twitterAuth, filters)
.foreachRDD(tweets -> {
tweets.map(t -> t.getText())
.foreach(x -> System.out.println(x));
return null;
});
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4517) Improve memory efficiency for python broadcast

2014-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1475#comment-1475
 ] 

Apache Spark commented on SPARK-4517:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3417

> Improve memory efficiency for python broadcast
> --
>
> Key: SPARK-4517
> URL: https://issues.apache.org/jira/browse/SPARK-4517
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>
> Currently, the Python broadcast (TorrentBroadcast) will have multiple copies 
> in :
> 1) 1 copy in python driver
> 2) 1 copy in disks of driver (serialized and compressed)
> 3) 2 copies in JVM driver (one is unserialized, one is serialized and 
> compressed)
> 4) 2 copies in executor (one is unserialized, one is serialized and 
> compressed)
> 5) one copy in each python worker.
> Some of them are different in HTTPBroadcast:
> 3)  one copy in memory of driver, one copy in disk (serialized and compressed)
> 4) one copy in memory of executor
> If the python broadcast is 4G, then it need 12G in driver, and 8+4x G in 
> executor (x is the number of python worker, it's the number of CPUs usually).
> The Python broadcast is already serialized and compressed in Python, it 
> should not be serialized and compressed again in JVM. Also, JVM does not need 
> to know the content of it, so it could be out of JVM.
> So, we should have specified broadcast implementation for Python, it stores 
> the serialized and compressed data in disks, transferred to executors in p2p 
> way (similar to TorrentBroadcast), sent to python workers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4518) Filestream sometimes processes files twice

2014-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1473#comment-1473
 ] 

Apache Spark commented on SPARK-4518:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/3419

> Filestream sometimes processes files twice
> --
>
> Key: SPARK-4518
> URL: https://issues.apache.org/jira/browse/SPARK-4518
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.2, 1.1.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4519) Filestream does not use hadoop configuration set within sparkContext.hadoopConfiguration

2014-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1474#comment-1474
 ] 

Apache Spark commented on SPARK-4519:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/3419

> Filestream does not use hadoop configuration set within 
> sparkContext.hadoopConfiguration
> 
>
> Key: SPARK-4519
> URL: https://issues.apache.org/jira/browse/SPARK-4519
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.0.2, 1.1.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4559) Adding support for ucase and lcase

2014-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1471#comment-1471
 ] 

Apache Spark commented on SPARK-4559:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/3418

> Adding support for ucase and lcase
> --
>
> Key: SPARK-4559
> URL: https://issues.apache.org/jira/browse/SPARK-4559
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0
>
>
> Adding support for ucase and lcase in spark sql



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4559) Adding support for ucase and lcase

2014-11-22 Thread wangfei (JIRA)
wangfei created SPARK-4559:
--

 Summary: Adding support for ucase and lcase
 Key: SPARK-4559
 URL: https://issues.apache.org/jira/browse/SPARK-4559
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


Adding support for ucase and lcase in spark sql



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4489) JavaPairRDD.collectAsMap from checkpoint RDD may fail with ClassCastException

2014-11-22 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1462#comment-1462
 ] 

Josh Rosen commented on SPARK-4489:
---

It looks like this is still a legitimate issue; the underlying bug is due to 
the Java API's handling of ClassTags plus incomplete test coverage for the Java 
API.  Regarding the ClassTag workaround in the gist, I think that you might be 
able to use the {{retag()}} method that I added in the fix to SPARK-1040 to 
quickly fix this.  I may be able to take a look at this reproduction later, but 
I'm going to leave this unassigned for now since it would be a great starter 
task for someone to pick up.

> JavaPairRDD.collectAsMap from checkpoint RDD may fail with ClassCastException
> -
>
> Key: SPARK-4489
> URL: https://issues.apache.org/jira/browse/SPARK-4489
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.1.0
>Reporter: Christopher Ng
>
> Calling collectAsMap() on a JavaPairRDD reconstructed from a checkpoint fails 
> with a ClassCastException:
> Exception in thread "main" java.lang.ClassCastException: [Ljava.lang.Object; 
> cannot be cast to [Lscala.Tuple2;
>   at 
> org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:595)
>   at 
> org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala:569)
>   at org.facboy.spark.CheckpointBug.main(CheckpointBug.java:46)
> Code sample reproducing the issue: 
> https://gist.github.com/facboy/8387e950ffb0746a8272



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4530) GradientDescent get a wrong gradient value according to the gradient formula, which is caused by the miniBatchSize parameter.

2014-11-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4530:
-
Priority: Major  (was: Blocker)

See comments on the PR. I don't think these things rise to the level of 
'blocker'

> GradientDescent get a wrong gradient value according to the gradient formula, 
> which is caused by the miniBatchSize parameter.
> -
>
> Key: SPARK-4530
> URL: https://issues.apache.org/jira/browse/SPARK-4530
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Guoqiang Li
>
> This bug is caused by {{RDD.sample}}
> The number of  {{RDD.sample}}  returns is not fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4558) History Server waits ~10s before starting up

2014-11-22 Thread Andrew Or (JIRA)
Andrew Or created SPARK-4558:


 Summary: History Server waits ~10s before starting up
 Key: SPARK-4558
 URL: https://issues.apache.org/jira/browse/SPARK-4558
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Minor


After you call `sbin/start-history-server.sh`, it waits about 10s before 
actually starting up. I suspect this is a subtle bug related to log checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4490) Not found RandomGenerator through spark-shell

2014-11-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1430#comment-1430
 ] 

Sean Owen commented on SPARK-4490:
--

commons-math3 is still a dependency of core, yes. Are you saying this works 
with HEAD? that would make more sense, but in general I think you still would 
want to explicitly add breeze and commons-math3 to the classpath if you want to 
use them in spark-shell rather than rely on them being in the assembly.

> Not found RandomGenerator through spark-shell
> -
>
> Key: SPARK-4490
> URL: https://issues.apache.org/jira/browse/SPARK-4490
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: spark-shell
>Reporter: Kai Sasaki
>
> In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 
> is used. There is some workaround about this problem.
> http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3
> ```
> scala> import breeze.linalg._
> import breeze.linalg._
> scala> Matrix.rand[Double](3, 3)
> java.lang.NoClassDefFoundError: 
> org/apache/commons/math3/random/RandomGenerator
> at 
> breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205)
> at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:14)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:19)
> at $iwC$$iwC$$iwC$$iwC.(:21)
> at $iwC$$iwC$$iwC.(:23)
> at $iwC$$iwC.(:25)
> at $iwC.(:27)
> at (:29)
> at .(:33)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
> at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
> at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
> at 
> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
> at 
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
> at 
> org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
> at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> aused by: java.lang.ClassNotFoundException: 
> org.apache.commons.math3.random.RandomGenerator
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 44 more
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---

[jira] [Updated] (SPARK-4557) Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a Function<..., Void>

2014-11-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4557:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

(Don't think this is a bug, really.) Yes, it's possible VoidFunction didn't 
exist when this API was defined. It can't be changed now without breaking API 
compatibility but AFAICT VoidFunction would be more appropriate. Maybe this can 
happen with some other related Java API rationalization in Spark 2.x.

> Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a 
> Function<..., Void>
> ---
>
> Key: SPARK-4557
> URL: https://issues.apache.org/jira/browse/SPARK-4557
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Alexis Seigneurin
>Priority: Minor
>
> In *Java*, using Spark Streaming's foreachRDD function is quite verbose. You 
> have to write:
> {code:java}
> .foreachRDD(items -> {
> ...;
> return null;
> });
> {code}
> Instead of:
> {code:java}
> .foreachRDD(items -> ...);
> {code}
> This is because the foreachRDD method accepts a Function, Void> 
> instead of a VoidFunction>. This would make sense to change it 
> to a VoidFunction as, in Spark's API, the foreach method already accepts a 
> VoidFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4556) binary distribution assembly can't run in local mode

2014-11-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1415#comment-1415
 ] 

Patrick Wendell edited comment on SPARK-4556 at 11/22/14 10:17 PM:
---

Checkout make-distribution.sh rather than using maven directly. We might 
consider removing that maven target since I don't think it's actively 
maintained. We should document clearly that make-distribution.sh is way of 
building binaries.


was (Author: pwendell):
Checkout make-distribution.sh rather than using maven directly. We might 
consider removing that maven target since I don't think it's actively 
maintained.

> binary distribution assembly can't run in local mode
> 
>
> Key: SPARK-4556
> URL: https://issues.apache.org/jira/browse/SPARK-4556
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Reporter: Sean Busbey
>
> After building the binary distribution assembly, the resultant tarball can't 
> be used for local mode.
> {code}
> busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package
> [INFO] Scanning for projects...
> ...SNIP...
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 
> s]
> [INFO] Spark Project Networking ... SUCCESS [ 31.402 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  8.864 
> s]
> [INFO] Spark Project Core . SUCCESS [15:39 
> min]
> [INFO] Spark Project Bagel  SUCCESS [ 29.470 
> s]
> [INFO] Spark Project GraphX ... SUCCESS [05:20 
> min]
> [INFO] Spark Project Streaming  SUCCESS [11:02 
> min]
> [INFO] Spark Project Catalyst . SUCCESS [11:26 
> min]
> [INFO] Spark Project SQL .. SUCCESS [11:33 
> min]
> [INFO] Spark Project ML Library ... SUCCESS [14:27 
> min]
> [INFO] Spark Project Tools  SUCCESS [ 40.980 
> s]
> [INFO] Spark Project Hive . SUCCESS [11:45 
> min]
> [INFO] Spark Project REPL . SUCCESS [03:15 
> min]
> [INFO] Spark Project Assembly . SUCCESS [04:22 
> min]
> [INFO] Spark Project External Twitter . SUCCESS [ 43.567 
> s]
> [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 
> s]
> [INFO] Spark Project External Flume ... SUCCESS [01:41 
> min]
> [INFO] Spark Project External MQTT  SUCCESS [ 40.973 
> s]
> [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 
> s]
> [INFO] Spark Project External Kafka ... SUCCESS [01:23 
> min]
> [INFO] Spark Project Examples . SUCCESS [10:19 
> min]
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time: 01:47 h
> [INFO] Finished at: 2014-11-22T02:13:51-06:00
> [INFO] Final Memory: 79M/2759M
> [INFO] 
> 
> busbey2-MBA:spark busbey$ cd assembly/target/
> busbey2-MBA:target busbey$ mkdir dist-temp
> busbey2-MBA:target busbey$ tar -C dist-temp -xzf 
> spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz 
> busbey2-MBA:target busbey$ cd dist-temp/
> busbey2-MBA:dist-temp busbey$ ./bin/spark-shell
> ls: 
> /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10:
>  No such file or directory
> Failed to find Spark assembly in 
> /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10
> You need to build Spark before running this program.
> {code}
> It looks like the classpath calculations in {{bin/compute_classpath.sh}} 
> don't handle it.
> If I move all of the spark-*.jar files from the top level into the lib folder 
> and touch the RELEASE file, then the spark shell launches in local mode 
> normally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4556) binary distribution assembly can't run in local mode

2014-11-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1415#comment-1415
 ] 

Patrick Wendell commented on SPARK-4556:


Checkout make-distribution.sh rather than using maven directly. We might 
consider removing that maven target since I don't think it's actively 
maintained.

> binary distribution assembly can't run in local mode
> 
>
> Key: SPARK-4556
> URL: https://issues.apache.org/jira/browse/SPARK-4556
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Reporter: Sean Busbey
>
> After building the binary distribution assembly, the resultant tarball can't 
> be used for local mode.
> {code}
> busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package
> [INFO] Scanning for projects...
> ...SNIP...
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 
> s]
> [INFO] Spark Project Networking ... SUCCESS [ 31.402 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  8.864 
> s]
> [INFO] Spark Project Core . SUCCESS [15:39 
> min]
> [INFO] Spark Project Bagel  SUCCESS [ 29.470 
> s]
> [INFO] Spark Project GraphX ... SUCCESS [05:20 
> min]
> [INFO] Spark Project Streaming  SUCCESS [11:02 
> min]
> [INFO] Spark Project Catalyst . SUCCESS [11:26 
> min]
> [INFO] Spark Project SQL .. SUCCESS [11:33 
> min]
> [INFO] Spark Project ML Library ... SUCCESS [14:27 
> min]
> [INFO] Spark Project Tools  SUCCESS [ 40.980 
> s]
> [INFO] Spark Project Hive . SUCCESS [11:45 
> min]
> [INFO] Spark Project REPL . SUCCESS [03:15 
> min]
> [INFO] Spark Project Assembly . SUCCESS [04:22 
> min]
> [INFO] Spark Project External Twitter . SUCCESS [ 43.567 
> s]
> [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 
> s]
> [INFO] Spark Project External Flume ... SUCCESS [01:41 
> min]
> [INFO] Spark Project External MQTT  SUCCESS [ 40.973 
> s]
> [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 
> s]
> [INFO] Spark Project External Kafka ... SUCCESS [01:23 
> min]
> [INFO] Spark Project Examples . SUCCESS [10:19 
> min]
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time: 01:47 h
> [INFO] Finished at: 2014-11-22T02:13:51-06:00
> [INFO] Final Memory: 79M/2759M
> [INFO] 
> 
> busbey2-MBA:spark busbey$ cd assembly/target/
> busbey2-MBA:target busbey$ mkdir dist-temp
> busbey2-MBA:target busbey$ tar -C dist-temp -xzf 
> spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz 
> busbey2-MBA:target busbey$ cd dist-temp/
> busbey2-MBA:dist-temp busbey$ ./bin/spark-shell
> ls: 
> /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10:
>  No such file or directory
> Failed to find Spark assembly in 
> /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10
> You need to build Spark before running this program.
> {code}
> It looks like the classpath calculations in {{bin/compute_classpath.sh}} 
> don't handle it.
> If I move all of the spark-*.jar files from the top level into the lib folder 
> and touch the RELEASE file, then the spark shell launches in local mode 
> normally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4556) binary distribution assembly can't run in local mode

2014-11-22 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1410#comment-1410
 ] 

Sean Busbey commented on SPARK-4556:


Well, why does the layout of the binary distribution differ from the layout in 
a release?

At a minimum the README should be updated to clarify the purpose of the binary 
distribution. Preferably, the README should include instructions for taking the 
binary distribution and deploying it to be runnable.

> binary distribution assembly can't run in local mode
> 
>
> Key: SPARK-4556
> URL: https://issues.apache.org/jira/browse/SPARK-4556
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Reporter: Sean Busbey
>
> After building the binary distribution assembly, the resultant tarball can't 
> be used for local mode.
> {code}
> busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package
> [INFO] Scanning for projects...
> ...SNIP...
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 
> s]
> [INFO] Spark Project Networking ... SUCCESS [ 31.402 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  8.864 
> s]
> [INFO] Spark Project Core . SUCCESS [15:39 
> min]
> [INFO] Spark Project Bagel  SUCCESS [ 29.470 
> s]
> [INFO] Spark Project GraphX ... SUCCESS [05:20 
> min]
> [INFO] Spark Project Streaming  SUCCESS [11:02 
> min]
> [INFO] Spark Project Catalyst . SUCCESS [11:26 
> min]
> [INFO] Spark Project SQL .. SUCCESS [11:33 
> min]
> [INFO] Spark Project ML Library ... SUCCESS [14:27 
> min]
> [INFO] Spark Project Tools  SUCCESS [ 40.980 
> s]
> [INFO] Spark Project Hive . SUCCESS [11:45 
> min]
> [INFO] Spark Project REPL . SUCCESS [03:15 
> min]
> [INFO] Spark Project Assembly . SUCCESS [04:22 
> min]
> [INFO] Spark Project External Twitter . SUCCESS [ 43.567 
> s]
> [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 
> s]
> [INFO] Spark Project External Flume ... SUCCESS [01:41 
> min]
> [INFO] Spark Project External MQTT  SUCCESS [ 40.973 
> s]
> [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 
> s]
> [INFO] Spark Project External Kafka ... SUCCESS [01:23 
> min]
> [INFO] Spark Project Examples . SUCCESS [10:19 
> min]
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time: 01:47 h
> [INFO] Finished at: 2014-11-22T02:13:51-06:00
> [INFO] Final Memory: 79M/2759M
> [INFO] 
> 
> busbey2-MBA:spark busbey$ cd assembly/target/
> busbey2-MBA:target busbey$ mkdir dist-temp
> busbey2-MBA:target busbey$ tar -C dist-temp -xzf 
> spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz 
> busbey2-MBA:target busbey$ cd dist-temp/
> busbey2-MBA:dist-temp busbey$ ./bin/spark-shell
> ls: 
> /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10:
>  No such file or directory
> Failed to find Spark assembly in 
> /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10
> You need to build Spark before running this program.
> {code}
> It looks like the classpath calculations in {{bin/compute_classpath.sh}} 
> don't handle it.
> If I move all of the spark-*.jar files from the top level into the lib folder 
> and touch the RELEASE file, then the spark shell launches in local mode 
> normally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4556) binary distribution assembly can't run in local mode

2014-11-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1406#comment-1406
 ] 

Sean Owen commented on SPARK-4556:
--

Hm, but is that a bug? I think compute-classpath.sh is designed to support 
running from the project root in development, or running from the files as laid 
out in the release, at least judging from your comments and the script itself. 
I don't think the raw contents of the assembly JAR themselves are a runnable 
installation.

> binary distribution assembly can't run in local mode
> 
>
> Key: SPARK-4556
> URL: https://issues.apache.org/jira/browse/SPARK-4556
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Reporter: Sean Busbey
>
> After building the binary distribution assembly, the resultant tarball can't 
> be used for local mode.
> {code}
> busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package
> [INFO] Scanning for projects...
> ...SNIP...
> [INFO] 
> 
> [INFO] Reactor Summary:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 
> s]
> [INFO] Spark Project Networking ... SUCCESS [ 31.402 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  8.864 
> s]
> [INFO] Spark Project Core . SUCCESS [15:39 
> min]
> [INFO] Spark Project Bagel  SUCCESS [ 29.470 
> s]
> [INFO] Spark Project GraphX ... SUCCESS [05:20 
> min]
> [INFO] Spark Project Streaming  SUCCESS [11:02 
> min]
> [INFO] Spark Project Catalyst . SUCCESS [11:26 
> min]
> [INFO] Spark Project SQL .. SUCCESS [11:33 
> min]
> [INFO] Spark Project ML Library ... SUCCESS [14:27 
> min]
> [INFO] Spark Project Tools  SUCCESS [ 40.980 
> s]
> [INFO] Spark Project Hive . SUCCESS [11:45 
> min]
> [INFO] Spark Project REPL . SUCCESS [03:15 
> min]
> [INFO] Spark Project Assembly . SUCCESS [04:22 
> min]
> [INFO] Spark Project External Twitter . SUCCESS [ 43.567 
> s]
> [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 
> s]
> [INFO] Spark Project External Flume ... SUCCESS [01:41 
> min]
> [INFO] Spark Project External MQTT  SUCCESS [ 40.973 
> s]
> [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 
> s]
> [INFO] Spark Project External Kafka ... SUCCESS [01:23 
> min]
> [INFO] Spark Project Examples . SUCCESS [10:19 
> min]
> [INFO] 
> 
> [INFO] BUILD SUCCESS
> [INFO] 
> 
> [INFO] Total time: 01:47 h
> [INFO] Finished at: 2014-11-22T02:13:51-06:00
> [INFO] Final Memory: 79M/2759M
> [INFO] 
> 
> busbey2-MBA:spark busbey$ cd assembly/target/
> busbey2-MBA:target busbey$ mkdir dist-temp
> busbey2-MBA:target busbey$ tar -C dist-temp -xzf 
> spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz 
> busbey2-MBA:target busbey$ cd dist-temp/
> busbey2-MBA:dist-temp busbey$ ./bin/spark-shell
> ls: 
> /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10:
>  No such file or directory
> Failed to find Spark assembly in 
> /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10
> You need to build Spark before running this program.
> {code}
> It looks like the classpath calculations in {{bin/compute_classpath.sh}} 
> don't handle it.
> If I move all of the spark-*.jar files from the top level into the lib folder 
> and touch the RELEASE file, then the spark shell launches in local mode 
> normally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4377) ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to deserialize a serialized ActorRef without an ActorSystem in scope.

2014-11-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4377:
---
Fix Version/s: 1.3.0

> ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to 
> deserialize a serialized ActorRef without an ActorSystem in scope.
> -
>
> Key: SPARK-4377
> URL: https://issues.apache.org/jira/browse/SPARK-4377
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 1.2.0
>Reporter: Josh Rosen
>Assignee: Prashant Sharma
>Priority: Blocker
> Fix For: 1.3.0
>
>
> It looks like ZooKeeperPersistenceEngine is broken in the current Spark 
> master (23f5bdf06a388e08ea5a69e848f0ecd5165aa481).  Here's a log excerpt from 
> a secondary master when it takes over from a failed primary master:
> {code}
> 14/11/13 04:37:12 WARN ConnectionStateManager: There are no 
> ConnectionStateListeners registered.
> 14/11/13 04:37:19 INFO Master: Registering worker 172.17.0.223: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:20 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.223: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:43 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:47 INFO Master: Registering worker 172.17.0.223: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:51 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.223: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.224: with 8 
> cores, 984.0 MB RAM
> 14/11/13 04:38:06 INFO ZooKeeperLeaderElectionAgent: We have gained leadership
> 14/11/13 04:38:06 WARN ZooKeeperPersistenceEngine: Exception while reading 
> persisted file, deleting
> java.io.IOException: java.lang.IllegalStateException: Trying to deserialize a 
> serialized ActorRef without an ActorSystem in scope. Use 
> 'akka.serialization.Serialization.currentSystem.withValue(system) { ... }'
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:988)
>   at 
> org.apache.spark.deploy.master.ApplicationInfo.readObject(ApplicationInfo.scala:51)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.deserializeFromFile(ZooKeeperPersistenceEngine.scala:69)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:54)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:32)
>   at 
> org.apache.spark.deploy.master.PersistenceEngine$class.readPersistedData(PersistenceEngine.scala:84)
>   at 
> org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.readPersistedData(ZooKeeperPersistenceEngine

[jira] [Created] (SPARK-4557) Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a Function<..., Void>

2014-11-22 Thread Alexis Seigneurin (JIRA)
Alexis Seigneurin created SPARK-4557:


 Summary: Spark Streaming' foreachRDD method should accept a 
VoidFunction<...>, not a Function<..., Void>
 Key: SPARK-4557
 URL: https://issues.apache.org/jira/browse/SPARK-4557
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Alexis Seigneurin


In *Java*, using Spark Streaming's foreachRDD function is quite verbose. You 
have to write:

{code:java}
.foreachRDD(items -> {
...;
return null;
});
{code}

Instead of:

{code:java}
.foreachRDD(items -> ...);
{code}

This is because the foreachRDD method accepts a Function, Void> 
instead of a VoidFunction>. This would make sense to change it to 
a VoidFunction as, in Spark's API, the foreach method already accepts a 
VoidFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4556) binary distribution assembly can't run in local mode

2014-11-22 Thread Sean Busbey (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Busbey updated SPARK-4556:
---
Description: 
After building the binary distribution assembly, the resultant tarball can't be 
used for local mode.

{code}
busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package
[INFO] Scanning for projects...
...SNIP...
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [ 32.227 s]
[INFO] Spark Project Networking ... SUCCESS [ 31.402 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  8.864 s]
[INFO] Spark Project Core . SUCCESS [15:39 min]
[INFO] Spark Project Bagel  SUCCESS [ 29.470 s]
[INFO] Spark Project GraphX ... SUCCESS [05:20 min]
[INFO] Spark Project Streaming  SUCCESS [11:02 min]
[INFO] Spark Project Catalyst . SUCCESS [11:26 min]
[INFO] Spark Project SQL .. SUCCESS [11:33 min]
[INFO] Spark Project ML Library ... SUCCESS [14:27 min]
[INFO] Spark Project Tools  SUCCESS [ 40.980 s]
[INFO] Spark Project Hive . SUCCESS [11:45 min]
[INFO] Spark Project REPL . SUCCESS [03:15 min]
[INFO] Spark Project Assembly . SUCCESS [04:22 min]
[INFO] Spark Project External Twitter . SUCCESS [ 43.567 s]
[INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 s]
[INFO] Spark Project External Flume ... SUCCESS [01:41 min]
[INFO] Spark Project External MQTT  SUCCESS [ 40.973 s]
[INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 s]
[INFO] Spark Project External Kafka ... SUCCESS [01:23 min]
[INFO] Spark Project Examples . SUCCESS [10:19 min]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 01:47 h
[INFO] Finished at: 2014-11-22T02:13:51-06:00
[INFO] Final Memory: 79M/2759M
[INFO] 
busbey2-MBA:spark busbey$ cd assembly/target/
busbey2-MBA:target busbey$ mkdir dist-temp
busbey2-MBA:target busbey$ tar -C dist-temp -xzf 
spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz 
busbey2-MBA:target busbey$ cd dist-temp/
busbey2-MBA:dist-temp busbey$ ./bin/spark-shell
ls: 
/Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10:
 No such file or directory
Failed to find Spark assembly in 
/Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10
You need to build Spark before running this program.
{code}

It looks like the classpath calculations in {{bin/compute_classpath.sh}} don't 
handle it.

If I move all of the spark-*.jar files from the top level into the lib folder 
and touch the RELEASE file, then the spark shell launches in local mode 
normally.

  was:
After building the binary distribution assembly, the resultant tarball can't be 
used for local mode.

{code}
busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package
[INFO] Scanning for projects...
...SNIP...
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [ 32.227 s]
[INFO] Spark Project Networking ... SUCCESS [ 31.402 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  8.864 s]
[INFO] Spark Project Core . SUCCESS [15:39 min]
[INFO] Spark Project Bagel  SUCCESS [ 29.470 s]
[INFO] Spark Project GraphX ... SUCCESS [05:20 min]
[INFO] Spark Project Streaming  SUCCESS [11:02 min]
[INFO] Spark Project Catalyst . SUCCESS [11:26 min]
[INFO] Spark Project SQL .. SUCCESS [11:33 min]
[INFO] Spark Project ML Library ... SUCCESS [14:27 min]
[INFO] Spark Project Tools  SUCCESS [ 40.980 s]
[INFO] Spark Project Hive . SUCCESS [11:45 min]
[INFO] Spark Project REPL . SUCCESS [03:15 min]
[INFO] Spark Project Assembly . SUCCESS [04:22 min]
[INFO] Spark Project External Twitter . SUCCESS [ 43.567 s]
[INFO] Spark Project External Flume Sink .

[jira] [Created] (SPARK-4556) binary distribution assembly can't run in local mode

2014-11-22 Thread Sean Busbey (JIRA)
Sean Busbey created SPARK-4556:
--

 Summary: binary distribution assembly can't run in local mode
 Key: SPARK-4556
 URL: https://issues.apache.org/jira/browse/SPARK-4556
 Project: Spark
  Issue Type: Bug
  Components: Build, Spark Shell
Reporter: Sean Busbey


After building the binary distribution assembly, the resultant tarball can't be 
used for local mode.

{code}
busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package
[INFO] Scanning for projects...
...SNIP...
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [ 32.227 s]
[INFO] Spark Project Networking ... SUCCESS [ 31.402 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  8.864 s]
[INFO] Spark Project Core . SUCCESS [15:39 min]
[INFO] Spark Project Bagel  SUCCESS [ 29.470 s]
[INFO] Spark Project GraphX ... SUCCESS [05:20 min]
[INFO] Spark Project Streaming  SUCCESS [11:02 min]
[INFO] Spark Project Catalyst . SUCCESS [11:26 min]
[INFO] Spark Project SQL .. SUCCESS [11:33 min]
[INFO] Spark Project ML Library ... SUCCESS [14:27 min]
[INFO] Spark Project Tools  SUCCESS [ 40.980 s]
[INFO] Spark Project Hive . SUCCESS [11:45 min]
[INFO] Spark Project REPL . SUCCESS [03:15 min]
[INFO] Spark Project Assembly . SUCCESS [04:22 min]
[INFO] Spark Project External Twitter . SUCCESS [ 43.567 s]
[INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 s]
[INFO] Spark Project External Flume ... SUCCESS [01:41 min]
[INFO] Spark Project External MQTT  SUCCESS [ 40.973 s]
[INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 s]
[INFO] Spark Project External Kafka ... SUCCESS [01:23 min]
[INFO] Spark Project Examples . SUCCESS [10:19 min]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 01:47 h
[INFO] Finished at: 2014-11-22T02:13:51-06:00
[INFO] Final Memory: 79M/2759M
[INFO] 
{code}
busbey2-MBA:spark busbey$ cd assembly/target/
busbey2-MBA:target busbey$ mkdir dist-temp
busbey2-MBA:target busbey$ tar -C dist-temp -xzf 
spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz 
busbey2-MBA:target busbey$ cd dist-temp/
busbey2-MBA:dist-temp busbey$ ./bin/spark-shell
ls: 
/Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10:
 No such file or directory
Failed to find Spark assembly in 
/Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10
You need to build Spark before running this program.
{code}

It looks like the classpath calculations in {{bin/compute_classpath.sh}} don't 
handle it.

If I move all of the spark-*.jar files from the top level into the lib folder 
and touch the RELEASE file, then the spark shell launches in local mode 
normally.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4507) PR merge script should support closing multiple JIRA tickets

2014-11-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4507:
---
Labels: starter  (was: )

> PR merge script should support closing multiple JIRA tickets
> 
>
> Key: SPARK-4507
> URL: https://issues.apache.org/jira/browse/SPARK-4507
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Josh Rosen
>Priority: Minor
>  Labels: starter
>
> For pull requests that reference multiple JIRAs in their titles, it would be 
> helpful if the PR merge script offered to close all of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds

2014-11-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1517:
---
Priority: Critical  (was: Major)

> Publish nightly snapshots of documentation, maven artifacts, and binary builds
> --
>
> Key: SPARK-1517
> URL: https://issues.apache.org/jira/browse/SPARK-1517
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Reporter: Patrick Wendell
>Priority: Critical
>
> Should be pretty easy to do with Jenkins. The only thing I can think of that 
> would be tricky is to set up credentials so that jenkins can publish this 
> stuff somewhere on apache infra.
> Ideally we don't want to have to put a private key on every jenkins box 
> (since they are otherwise pretty stateless). One idea is to encrypt these 
> credentials with a passphrase and post them somewhere publicly visible. Then 
> the jenkins build can download the credentials provided we set a passphrase 
> in an environment variable in jenkins. There may be simpler solutions as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds

2014-11-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1517:
---
Target Version/s: 1.3.0

> Publish nightly snapshots of documentation, maven artifacts, and binary builds
> --
>
> Key: SPARK-1517
> URL: https://issues.apache.org/jira/browse/SPARK-1517
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Reporter: Patrick Wendell
>
> Should be pretty easy to do with Jenkins. The only thing I can think of that 
> would be tricky is to set up credentials so that jenkins can publish this 
> stuff somewhere on apache infra.
> Ideally we don't want to have to put a private key on every jenkins box 
> (since they are otherwise pretty stateless). One idea is to encrypt these 
> credentials with a passphrase and post them somewhere publicly visible. Then 
> the jenkins build can download the credentials provided we set a passphrase 
> in an environment variable in jenkins. There may be simpler solutions as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4542) Post nightly releases

2014-11-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4542.

Resolution: Duplicate

> Post nightly releases
> -
>
> Key: SPARK-4542
> URL: https://issues.apache.org/jira/browse/SPARK-4542
> Project: Spark
>  Issue Type: Improvement
>Reporter: Arun Ahuja
>
> Spark developers are continually including new improvements and fixes to 
> sometimes critfical issues.  To speed up review and resolve the issues faster 
> it will faster for multiple people to test ( or use those fixes if they are 
> critical ) if there are 1) snapshots to maven and 2) Full 
> distribution/scripts perhaps posted somewhere.  Otherwise each individual 
> developer has to pull and rebuild which maybe a very long process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds

2014-11-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1517:
---
Fix Version/s: (was: 1.2.0)

> Publish nightly snapshots of documentation, maven artifacts, and binary builds
> --
>
> Key: SPARK-1517
> URL: https://issues.apache.org/jira/browse/SPARK-1517
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Reporter: Patrick Wendell
>
> Should be pretty easy to do with Jenkins. The only thing I can think of that 
> would be tricky is to set up credentials so that jenkins can publish this 
> stuff somewhere on apache infra.
> Ideally we don't want to have to put a private key on every jenkins box 
> (since they are otherwise pretty stateless). One idea is to encrypt these 
> credentials with a passphrase and post them somewhere publicly visible. Then 
> the jenkins build can download the credentials provided we set a passphrase 
> in an environment variable in jenkins. There may be simpler solutions as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2143) Display Spark version on Driver web page

2014-11-22 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2143:
---
Priority: Critical  (was: Major)

> Display Spark version on Driver web page
> 
>
> Key: SPARK-2143
> URL: https://issues.apache.org/jira/browse/SPARK-2143
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Jeff Hammerbacher
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS

2014-11-22 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222172#comment-14222172
 ] 

Patrick Wendell commented on SPARK-4516:


Okay then I think this is just a documentation issue. We should add the 
documentation about direct buffers to the main configuration page and also 
mention it in the doc about network options.

> Netty off-heap memory use causes executors to be killed by OS
> -
>
> Key: SPARK-4516
> URL: https://issues.apache.org/jira/browse/SPARK-4516
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
> Environment: Linux, Mesos
>Reporter: Hector Yee
>Priority: Critical
>  Labels: netty, shuffle
>
> The netty block transfer manager has a race condition where it closes an 
> active connection resulting in the error below. Switching to nio seems to 
> alleviate the problem.
> {code}
> 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it.
> 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch 
> of 1 outstanding blocks 
> java.io.IOException: Failed to connect to 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
> at 
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246)
> at 
> com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused: 
> i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

--

[jira] [Commented] (SPARK-4548) Python broadcast is very slow

2014-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222154#comment-14222154
 ] 

Apache Spark commented on SPARK-4548:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3417

> Python broadcast is very slow
> -
>
> Key: SPARK-4548
> URL: https://issues.apache.org/jira/browse/SPARK-4548
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.2.0
>Reporter: Davies Liu
>
> Python broadcast in 1.2 is much slower than 1.1: 
> In spark-perf tests:
>   name1.1 1.2  speedup
> python-broadcast-w-set3.6316.68   -78.23%



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4555) Add forward compatibility tests to JsonProtocol

2014-11-22 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-4555:
-

 Summary: Add forward compatibility tests to JsonProtocol
 Key: SPARK-4555
 URL: https://issues.apache.org/jira/browse/SPARK-4555
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Josh Rosen


The web UI / event listener's JsonProtocol is designed to be backwards- and 
forwards-compatible: newer versions of Spark should be able to consume event 
logs written by older versions and vice-versa.

We currently have backwards-compatibility tests for the "newer version reads 
older log" case; this JIRA tracks progress for adding the opposite 
forwards-compatibility tests.

This type of test could be non-trivial to write, since I think we'd need to 
actually run a script against multiple compiled Spark releases, so this test 
might need to sit outside of Spark Core itself as part of an integration 
testing suite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Evan Sparks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222112#comment-14222112
 ] 

Evan Sparks commented on SPARK-1405:


Bucket has been created: 
s3://files.sparks.requester.pays/enwiki_category_text/ - All in all there are 
181 ~50mb files (actually closer to 10GB). 

It probably makes sense to use http://sweble.org/ or something to strip the 
boilerplate, etc. from the documents for the purposes of topic modeling.

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222108#comment-14222108
 ] 

Debasish Das edited comment on SPARK-1405 at 11/22/14 6:40 PM:
---

[~sparks] that will be awesome...I should be fine running experiments on EC2...


was (Author: debasish83):
@sparks that will be awesome...I should be fine running experiments on EC2...

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222108#comment-14222108
 ] 

Debasish Das commented on SPARK-1405:
-

@sparks that will be awesome...I should be fine running experiments on EC2...

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Evan Sparks (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222105#comment-14222105
 ] 

Evan Sparks commented on SPARK-1405:


[~gq] - Those are great numbers for a very high number of topics - it's a 
little tough to follow what's leading to the super-linear scaling in #topics in 
your code, though. Are you using FastLDA or something similar to speed up 
sampling? (http://www.ics.uci.edu/~newman/pubs/fastlda.pdf)

Pedro has been testing on a wikipedia dump on s3 which I provided. It's XML 
formatted, one document per line, so it's easy to parse. I will copy this to a 
requester-pays bucket (which will be free if you run your experiments on ec2) 
now so that everyone working on this can use it for testing.

NIPS dataset seems fine for small-scale testing, but I think it's important 
that we test this implementation across a range of values for documents, words, 
topics, and tokens - hence, I think the data generator that Pedro is working on 
is a really good idea (and follows the convention of the existing data 
generators in MLlib). We'll have to be a little careful here, because some of 
the methods for making LDA fast rely on the fact that it tends to converge 
fast, and I expect that data generated by the model will be much easier to fit 
than real data.

Also, can we try and be consistent in our terminology - getting the # of unique 
words confused with all the words in a corpus is easy. I propose "words" and 
"tokens" for these two things.

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222089#comment-14222089
 ] 

Debasish Das commented on SPARK-1405:
-

NIPS dataset is common for PLSA and additive regularization based matrix 
factorization formulations as well since the experiments in this paper focused 
on the NIPS dataset as well... 
http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf

I will be using NIPS dataset for quality experiments but for scaling 
experiments, wiki data is good...wiki data was demo-ed by Databricks in last 
spark summit...it will be great if we can get it from that demo

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222048#comment-14222048
 ] 

Guoqiang Li commented on SPARK-1405:


Sorry, I mean the wikipedia data download URL. How much text we need it? I 
think one billion words is appropriate.

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Pedro Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222036#comment-14222036
 ] 

Pedro Rodriguez commented on SPARK-1405:


Not sure which download URL you are referring to?

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222033#comment-14222033
 ] 

Guoqiang Li commented on SPARK-1405:


OK, Where is the download URL?

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Pedro Rodriguez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222030#comment-14222030
 ] 

Pedro Rodriguez commented on SPARK-1405:


I don't know of a larger data set, but I am working on an LDA data set 
generator based on the generative model. It should be good for benchmark 
testing but still be reasonable from the ML perspective.

The metric is in the LDA code (which is turned on and off with a flag on the 
LDA model). You can find it here in the logLikelihood function:
https://github.com/EntilZha/spark/blob/LDA/graphx/src/main/scala/org/apache/spark/graphx/lib/LDA.scala

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222027#comment-14222027
 ] 

Debasish Das commented on SPARK-1405:
-

[~pedrorodriguez] did you write the metric in your repo as well ? That way I 
don't have to code it up again..

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222024#comment-14222024
 ] 

Debasish Das edited comment on SPARK-1405 at 11/22/14 4:22 PM:
---

We need a larger dataset as well where topics go to the range of 1+...That 
range will stress factorization based LSA formulations since there is broadcast 
of factors at each stepNIPS dataset is small...Let's start with that...But 
we should test a large dataset like wikipedia as well..If there is a 
pre-processed version from either mahout or scikit-learn we can use that ?


was (Author: debasish83):
We need a larger dataset as well where topics go to the range of 1+...That 
range will stress factorization based LSA formulations since there is broadcast 
of factors at each stepNIPS dataset is small...you guy's will be willing to 
test a wikipedia dataset for example ? If there is a pre-processed version from 
either mahout or scikit-learn we can use that ?

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-11-22 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222024#comment-14222024
 ] 

Debasish Das commented on SPARK-1405:
-

We need a larger dataset as well where topics go to the range of 1+...That 
range will stress factorization based LSA formulations since there is broadcast 
of factors at each stepNIPS dataset is small...you guy's will be willing to 
test a wikipedia dataset for example ? If there is a pre-processed version from 
either mahout or scikit-learn we can use that ?

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4554) Set fair scheduler pool for JDBC client session in hive 13

2014-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221950#comment-14221950
 ] 

Apache Spark commented on SPARK-4554:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/3416

> Set fair scheduler pool for JDBC client session in hive 13
> --
>
> Key: SPARK-4554
> URL: https://issues.apache.org/jira/browse/SPARK-4554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0
>
>
> Now hive 13 shim does not support to set fair scheduler pool 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4554) Set fair scheduler pool for JDBC client session in hive 13

2014-11-22 Thread wangfei (JIRA)
wangfei created SPARK-4554:
--

 Summary: Set fair scheduler pool for JDBC client session in hive 13
 Key: SPARK-4554
 URL: https://issues.apache.org/jira/browse/SPARK-4554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


Now hive 13 shim does not support to set fair scheduler pool 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4288) Add Sparse Autoencoder algorithm to MLlib

2014-11-22 Thread Kai Sasaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221895#comment-14221895
 ] 

Kai Sasaki commented on SPARK-4288:
---

[~mengxr] Thank you. I'll join. 

> Add Sparse Autoencoder algorithm to MLlib 
> --
>
> Key: SPARK-4288
> URL: https://issues.apache.org/jira/browse/SPARK-4288
> Project: Spark
>  Issue Type: Wish
>  Components: MLlib
>Reporter: Guoqiang Li
>  Labels: features
>
> Are you proposing an implementation? Is it related to the neural network JIRA?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4553) query for parquet table with string fields in spark sql hive get binary result

2014-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221892#comment-14221892
 ] 

Apache Spark commented on SPARK-4553:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/3414

> query for parquet table with string fields in spark sql hive get binary result
> --
>
> Key: SPARK-4553
> URL: https://issues.apache.org/jira/browse/SPARK-4553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0
>
>
> run 
> create table test_parquet(key int, value string) stored as parquet;
> insert into table test_parquet select * from src;
> select * from test_parquet;
> get result as follow
> ...
> 282 [B@38fda3b
> 138 [B@1407a24
> 238 [B@12de6fb
> 419 [B@6c97695
> 15 [B@4885067
> 118 [B@156a8d3
> 72 [B@65d20dd
> 90 [B@4c18906
> 307 [B@60b24cc
> 19 [B@59cf51b
> 435 [B@39fdf37
> 10 [B@4f799d7
> 277 [B@3950951
> 273 [B@596bf4b
> 306 [B@3e91557
> 224 [B@3781d61
> 309 [B@2d0d128



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4552) query for empty parquet table in spark sql hive get IllegalArgumentException

2014-11-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221889#comment-14221889
 ] 

Apache Spark commented on SPARK-4552:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/3413

> query for empty parquet table in spark sql hive get IllegalArgumentException
> 
>
> Key: SPARK-4552
> URL: https://issues.apache.org/jira/browse/SPARK-4552
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
> Fix For: 1.2.0
>
>
> run
> create table test_parquet(key int, value string) stored as parquet;
> select * from test_parquet;
> get error as follow
> java.lang.IllegalArgumentException: Could not find Parquet metadata at path 
> file:/user/hive/warehouse/test_parquet
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.sc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4553) query for parquet table with string fields in spark sql hive get binary result

2014-11-22 Thread wangfei (JIRA)
wangfei created SPARK-4553:
--

 Summary: query for parquet table with string fields in spark sql 
hive get binary result
 Key: SPARK-4553
 URL: https://issues.apache.org/jira/browse/SPARK-4553
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


run 
create table test_parquet(key int, value string) stored as parquet;
insert into table test_parquet select * from src;
select * from test_parquet;
get result as follow

...
282 [B@38fda3b
138 [B@1407a24
238 [B@12de6fb
419 [B@6c97695
15 [B@4885067
118 [B@156a8d3
72 [B@65d20dd
90 [B@4c18906
307 [B@60b24cc
19 [B@59cf51b
435 [B@39fdf37
10 [B@4f799d7
277 [B@3950951
273 [B@596bf4b
306 [B@3e91557
224 [B@3781d61
309 [B@2d0d128



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4552) query for empty parquet table in spark sql hive get IllegalArgumentException

2014-11-22 Thread wangfei (JIRA)
wangfei created SPARK-4552:
--

 Summary: query for empty parquet table in spark sql hive get 
IllegalArgumentException
 Key: SPARK-4552
 URL: https://issues.apache.org/jira/browse/SPARK-4552
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


run
create table test_parquet(key int, value string) stored as parquet;
select * from test_parquet;
get error as follow

java.lang.IllegalArgumentException: Could not find Parquet metadata at path 
file:/user/hive/warehouse/test_parquet
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.sc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org