[jira] [Comment Edited] (SPARK-1503) Implement Nesterov's accelerated first-order method
[ https://issues.apache.org/jira/browse/SPARK-1503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221233#comment-14221233 ] Aaron Staple edited comment on SPARK-1503 at 11/23/14 6:55 AM: --- [~mengxr] Sorry for the delay. I wrote up a design proposal for the initial implementation. Let me know what you think, and if you'd like me to clarify anything. UPDATE: Ok, here's the document: https://docs.google.com/document/d/1L50O66LnBfVopFjptbet2ZTQRzriZTjKvlIILZwKsno/edit?usp=sharing was (Author: staple): [~mengxr] Sorry for the delay. I wrote up a design proposal for the initial implementation. Let me know what you think, and if you'd like me to clarify anything. UPDATE: On second thought, I'd actually like to make a few changes to the proposal. I'll follow up tomorrow with the updated version. Sorry for the confusion. > Implement Nesterov's accelerated first-order method > --- > > Key: SPARK-1503 > URL: https://issues.apache.org/jira/browse/SPARK-1503 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Aaron Staple > > Nesterov's accelerated first-order method is a drop-in replacement for > steepest descent but it converges much faster. We should implement this > method and compare its performance with existing algorithms, including SGD > and L-BFGS. > TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's > method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4417) New API: sample RDD to fixed number of items
[ https://issues.apache.org/jira/browse/SPARK-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222332#comment-14222332 ] Sandeep Singh commented on SPARK-4417: -- Can you assign this to me ? > New API: sample RDD to fixed number of items > > > Key: SPARK-4417 > URL: https://issues.apache.org/jira/browse/SPARK-4417 > Project: Spark > Issue Type: New Feature > Components: PySpark, Spark Core >Reporter: Davies Liu > > Sometimes, we just want to a fixed number of items randomly selected from an > RDD, for example, before sort an RDD we need to gather a fixed number of keys > from each partitions. > In order to do this, we need to two pass on the RDD, get the total number, > then calculate the right ratio for sampling. In fact, we could do this in one > pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries
[ https://issues.apache.org/jira/browse/SPARK-4561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4561: -- Target Version/s: 1.3.0 (was: 1.2.0) Good point; if we add a {{recursive}} option and have recursion off by default, then it's not urgent to fix this now since the new option will be backwards-compatible with what we ship in 1.2.0. > PySparkSQL's Row.asDict() should convert nested rows to dictionaries > > > Key: SPARK-4561 > URL: https://issues.apache.org/jira/browse/SPARK-4561 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.2.0 >Reporter: Josh Rosen >Assignee: Davies Liu > > In PySpark, you can call {{.asDict > ()}} on a SparkSQL {{Row}} to convert it to a dictionary. Unfortunately, > though, this does not convert nested rows to dictionaries. For example: > {code} > >>> sqlContext.sql("select results from results").first() > Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), > Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), > Row(time=3.276), Row(time=3.239), Row(time=3.149)]) > >>> sqlContext.sql("select results from results").first().asDict() > {u'results': [(3.762,), > (3.47,), > (3.559,), > (3.458,), > (3.229,), > (3.21,), > (3.166,), > (3.276,), > (3.239,), > (3.149,)]} > {code} > Actually, it looks like the nested fields are just left as Rows (IPython's > fancy display logic obscured this in my first example): > {code} > >>> Row(results=[Row(time=1), Row(time=2)]).asDict() > {'results': [Row(time=1), Row(time=2)]} > {code} > Here's the output I'd expect: > {code} > >>> Row(results=[Row(time=1), Row(time=2)]) > {'results' : [{'time': 1}, {'time': 2}]} > {code} > I ran into this issue when trying to use Pandas dataframes to display nested > data that I queried from Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries
[ https://issues.apache.org/jira/browse/SPARK-4561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222310#comment-14222310 ] Davies Liu commented on SPARK-4561: --- I tried to do it, but found that it's not easy, bacause Row() could be nested in MapType and ArrayType (even UDT), it also could be expensive. Maybe we need to do it optional, using recursive=True? > PySparkSQL's Row.asDict() should convert nested rows to dictionaries > > > Key: SPARK-4561 > URL: https://issues.apache.org/jira/browse/SPARK-4561 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.2.0 >Reporter: Josh Rosen >Assignee: Davies Liu > > In PySpark, you can call {{.asDict > ()}} on a SparkSQL {{Row}} to convert it to a dictionary. Unfortunately, > though, this does not convert nested rows to dictionaries. For example: > {code} > >>> sqlContext.sql("select results from results").first() > Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), > Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), > Row(time=3.276), Row(time=3.239), Row(time=3.149)]) > >>> sqlContext.sql("select results from results").first().asDict() > {u'results': [(3.762,), > (3.47,), > (3.559,), > (3.458,), > (3.229,), > (3.21,), > (3.166,), > (3.276,), > (3.239,), > (3.149,)]} > {code} > Actually, it looks like the nested fields are just left as Rows (IPython's > fancy display logic obscured this in my first example): > {code} > >>> Row(results=[Row(time=1), Row(time=2)]).asDict() > {'results': [Row(time=1), Row(time=2)]} > {code} > Here's the output I'd expect: > {code} > >>> Row(results=[Row(time=1), Row(time=2)]) > {'results' : [{'time': 1}, {'time': 2}]} > {code} > I ran into this issue when trying to use Pandas dataframes to display nested > data that I queried from Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries
[ https://issues.apache.org/jira/browse/SPARK-4561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4561: -- Target Version/s: 1.2.0 Assignee: Davies Liu [~davies], could you take a look at this since you're more familiar with this code than me? It might be nice to squeeze a fix for this into 1.2.0 before this API becomes stable. I noticed that there's two {{asDict()}} methods, one in each {{Row}} class; is there a way to avoid this duplication? Also, could we maybe add some user-facing doctests to this, e.g. {code} def asDict(self): """ Return this row as a dictionary. >>> Row(name='Alice', age=11).asDict() {'age': 11, 'name': 'Alice'} Nested rows will be converted into nested dictionaries: >>> Row(results=[Row(time=1), Row(time=2)]) {'results' : [{'time': 1}, {'time': 2}]} """ {code} > PySparkSQL's Row.asDict() should convert nested rows to dictionaries > > > Key: SPARK-4561 > URL: https://issues.apache.org/jira/browse/SPARK-4561 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.2.0 >Reporter: Josh Rosen >Assignee: Davies Liu > > In PySpark, you can call {{.asDict > ()}} on a SparkSQL {{Row}} to convert it to a dictionary. Unfortunately, > though, this does not convert nested rows to dictionaries. For example: > {code} > >>> sqlContext.sql("select results from results").first() > Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), > Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), > Row(time=3.276), Row(time=3.239), Row(time=3.149)]) > >>> sqlContext.sql("select results from results").first().asDict() > {u'results': [(3.762,), > (3.47,), > (3.559,), > (3.458,), > (3.229,), > (3.21,), > (3.166,), > (3.276,), > (3.239,), > (3.149,)]} > {code} > Actually, it looks like the nested fields are just left as Rows (IPython's > fancy display logic obscured this in my first example): > {code} > >>> Row(results=[Row(time=1), Row(time=2)]).asDict() > {'results': [Row(time=1), Row(time=2)]} > {code} > Here's the output I'd expect: > {code} > >>> Row(results=[Row(time=1), Row(time=2)]) > {'results' : [{'time': 1}, {'time': 2}]} > {code} > I ran into this issue when trying to use Pandas dataframes to display nested > data that I queried from Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries
[ https://issues.apache.org/jira/browse/SPARK-4561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4561: -- Description: In PySpark, you can call {{.asDict ()}} on a SparkSQL {{Row}} to convert it to a dictionary. Unfortunately, though, this does not convert nested rows to dictionaries. For example: {code} >>> sqlContext.sql("select results from results").first() Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), Row(time=3.276), Row(time=3.239), Row(time=3.149)]) >>> sqlContext.sql("select results from results").first().asDict() {u'results': [(3.762,), (3.47,), (3.559,), (3.458,), (3.229,), (3.21,), (3.166,), (3.276,), (3.239,), (3.149,)]} {code} Actually, it looks like the nested fields are just left as Rows (IPython's fancy display logic obscured this in my first example): {code} >>> Row(results=[Row(time=1), Row(time=2)]).asDict() {'results': [Row(time=1), Row(time=2)]} {code} Here's the output I'd expect: {code} >>> Row(results=[Row(time=1), Row(time=2)]) {'results' : [{'time': 1}, {'time': 2}]} {code} I ran into this issue when trying to use Pandas dataframes to display nested data that I queried from Spark SQL. was: In PySpark, you can call {{.asDict ()}} on a SparkSQL {{Row}} to convert it to a dictionary. Unfortunately, though, this does not convert nested rows to dictionaries. For example: {code} >>> sqlContext.sql("select results from results").first() Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), Row(time=3.276), Row(time=3.239), Row(time=3.149)]) >>> sqlContext.sql("select results from results").first().asDict() {u'results': [(3.762,), (3.47,), (3.559,), (3.458,), (3.229,), (3.21,), (3.166,), (3.276,), (3.239,), (3.149,)]} {code} Actually, it looks like the nested fields are just left as Rows (IPython's fancy display logic obscured this in my first example): {code} >>> Row(results=[Row(time=1), Row(time=2)]).asDict() {'results': [Row(time=1), Row(time=2)]} {code} I ran into this issue when trying to use Pandas dataframes to display nested data that I queried from Spark SQL. > PySparkSQL's Row.asDict() should convert nested rows to dictionaries > > > Key: SPARK-4561 > URL: https://issues.apache.org/jira/browse/SPARK-4561 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.2.0 >Reporter: Josh Rosen > > In PySpark, you can call {{.asDict > ()}} on a SparkSQL {{Row}} to convert it to a dictionary. Unfortunately, > though, this does not convert nested rows to dictionaries. For example: > {code} > >>> sqlContext.sql("select results from results").first() > Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), > Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), > Row(time=3.276), Row(time=3.239), Row(time=3.149)]) > >>> sqlContext.sql("select results from results").first().asDict() > {u'results': [(3.762,), > (3.47,), > (3.559,), > (3.458,), > (3.229,), > (3.21,), > (3.166,), > (3.276,), > (3.239,), > (3.149,)]} > {code} > Actually, it looks like the nested fields are just left as Rows (IPython's > fancy display logic obscured this in my first example): > {code} > >>> Row(results=[Row(time=1), Row(time=2)]).asDict() > {'results': [Row(time=1), Row(time=2)]} > {code} > Here's the output I'd expect: > {code} > >>> Row(results=[Row(time=1), Row(time=2)]) > {'results' : [{'time': 1}, {'time': 2}]} > {code} > I ran into this issue when trying to use Pandas dataframes to display nested > data that I queried from Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries
[ https://issues.apache.org/jira/browse/SPARK-4561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4561: -- Description: In PySpark, you can call {{.asDict ()}} on a SparkSQL {{Row}} to convert it to a dictionary. Unfortunately, though, this does not convert nested rows to dictionaries. For example: {code} >>> sqlContext.sql("select results from results").first() Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), Row(time=3.276), Row(time=3.239), Row(time=3.149)]) >>> sqlContext.sql("select results from results").first().asDict() {u'results': [(3.762,), (3.47,), (3.559,), (3.458,), (3.229,), (3.21,), (3.166,), (3.276,), (3.239,), (3.149,)]} {code} Actually, it looks like the nested fields are just left as Rows (IPython's fancy display logic obscured this in my first example): {code} >>> Row(results=[Row(time=1), Row(time=2)]).asDict() {'results': [Row(time=1), Row(time=2)]} {code} I ran into this issue when trying to use Pandas dataframes to display nested data that I queried from Spark SQL. was: In PySpark, you can call {{.asDict ()}} on a SparkSQL {{Row}} to convert it to a dictionary. Unfortunately, though, this does not convert nested rows to dictionaries. For example: {code} >>> sqlContext.sql("select results from results").first() Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), Row(time=3.276), Row(time=3.239), Row(time=3.149)]) >>> sqlContext.sql("select results from results").first().asDict() {u'results': [(3.762,), (3.47,), (3.559,), (3.458,), (3.229,), (3.21,), (3.166,), (3.276,), (3.239,), (3.149,)]} {code} I ran into this issue when trying to use Pandas dataframes to display nested data that I queried from Spark SQL. > PySparkSQL's Row.asDict() should convert nested rows to dictionaries > > > Key: SPARK-4561 > URL: https://issues.apache.org/jira/browse/SPARK-4561 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.2.0 >Reporter: Josh Rosen > > In PySpark, you can call {{.asDict > ()}} on a SparkSQL {{Row}} to convert it to a dictionary. Unfortunately, > though, this does not convert nested rows to dictionaries. For example: > {code} > >>> sqlContext.sql("select results from results").first() > Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), > Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), > Row(time=3.276), Row(time=3.239), Row(time=3.149)]) > >>> sqlContext.sql("select results from results").first().asDict() > {u'results': [(3.762,), > (3.47,), > (3.559,), > (3.458,), > (3.229,), > (3.21,), > (3.166,), > (3.276,), > (3.239,), > (3.149,)]} > {code} > Actually, it looks like the nested fields are just left as Rows (IPython's > fancy display logic obscured this in my first example): > {code} > >>> Row(results=[Row(time=1), Row(time=2)]).asDict() > {'results': [Row(time=1), Row(time=2)]} > {code} > I ran into this issue when trying to use Pandas dataframes to display nested > data that I queried from Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4561) PySparkSQL's Row.asDict() should convert nested rows to dictionaries
Josh Rosen created SPARK-4561: - Summary: PySparkSQL's Row.asDict() should convert nested rows to dictionaries Key: SPARK-4561 URL: https://issues.apache.org/jira/browse/SPARK-4561 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 1.2.0 Reporter: Josh Rosen In PySpark, you can call {{.asDict ()}} on a SparkSQL {{Row}} to convert it to a dictionary. Unfortunately, though, this does not convert nested rows to dictionaries. For example: {code} >>> sqlContext.sql("select results from results").first() Row(results=[Row(time=3.762), Row(time=3.47), Row(time=3.559), Row(time=3.458), Row(time=3.229), Row(time=3.21), Row(time=3.166), Row(time=3.276), Row(time=3.239), Row(time=3.149)]) >>> sqlContext.sql("select results from results").first().asDict() {u'results': [(3.762,), (3.47,), (3.559,), (3.458,), (3.229,), (3.21,), (3.166,), (3.276,), (3.239,), (3.149,)]} {code} I ran into this issue when trying to use Pandas dataframes to display nested data that I queried from Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4377) ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to deserialize a serialized ActorRef without an ActorSystem in scope.
[ https://issues.apache.org/jira/browse/SPARK-4377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4377: --- Affects Version/s: (was: 1.2.0) 1.3.0 > ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to > deserialize a serialized ActorRef without an ActorSystem in scope. > - > > Key: SPARK-4377 > URL: https://issues.apache.org/jira/browse/SPARK-4377 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Assignee: Prashant Sharma >Priority: Blocker > Fix For: 1.3.0 > > > It looks like ZooKeeperPersistenceEngine is broken in the current Spark > master (23f5bdf06a388e08ea5a69e848f0ecd5165aa481). Here's a log excerpt from > a secondary master when it takes over from a failed primary master: > {code} > 14/11/13 04:37:12 WARN ConnectionStateManager: There are no > ConnectionStateListeners registered. > 14/11/13 04:37:19 INFO Master: Registering worker 172.17.0.223: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:20 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.223: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:43 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:47 INFO Master: Registering worker 172.17.0.223: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:51 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.223: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:38:06 INFO ZooKeeperLeaderElectionAgent: We have gained leadership > 14/11/13 04:38:06 WARN ZooKeeperPersistenceEngine: Exception while reading > persisted file, deleting > java.io.IOException: java.lang.IllegalStateException: Trying to deserialize a > serialized ActorRef without an ActorSystem in scope. Use > 'akka.serialization.Serialization.currentSystem.withValue(system) { ... }' > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:988) > at > org.apache.spark.deploy.master.ApplicationInfo.readObject(ApplicationInfo.scala:51) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.deserializeFromFile(ZooKeeperPersistenceEngine.scala:69) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:54) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:32) > at > org.apache.spark.deploy.master.PersistenceEngine$class.readPersistedData(PersistenceEngine.scala:84) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.
[jira] [Updated] (SPARK-4377) ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to deserialize a serialized ActorRef without an ActorSystem in scope.
[ https://issues.apache.org/jira/browse/SPARK-4377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4377: --- Target Version/s: (was: 1.2.0) > ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to > deserialize a serialized ActorRef without an ActorSystem in scope. > - > > Key: SPARK-4377 > URL: https://issues.apache.org/jira/browse/SPARK-4377 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Assignee: Prashant Sharma >Priority: Blocker > Fix For: 1.3.0 > > > It looks like ZooKeeperPersistenceEngine is broken in the current Spark > master (23f5bdf06a388e08ea5a69e848f0ecd5165aa481). Here's a log excerpt from > a secondary master when it takes over from a failed primary master: > {code} > 14/11/13 04:37:12 WARN ConnectionStateManager: There are no > ConnectionStateListeners registered. > 14/11/13 04:37:19 INFO Master: Registering worker 172.17.0.223: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:20 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.223: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:43 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:47 INFO Master: Registering worker 172.17.0.223: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:51 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.223: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:38:06 INFO ZooKeeperLeaderElectionAgent: We have gained leadership > 14/11/13 04:38:06 WARN ZooKeeperPersistenceEngine: Exception while reading > persisted file, deleting > java.io.IOException: java.lang.IllegalStateException: Trying to deserialize a > serialized ActorRef without an ActorSystem in scope. Use > 'akka.serialization.Serialization.currentSystem.withValue(system) { ... }' > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:988) > at > org.apache.spark.deploy.master.ApplicationInfo.readObject(ApplicationInfo.scala:51) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.deserializeFromFile(ZooKeeperPersistenceEngine.scala:69) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:54) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:32) > at > org.apache.spark.deploy.master.PersistenceEngine$class.readPersistedData(PersistenceEngine.scala:84) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.readPersistedData(ZooKeeperPersi
[jira] [Resolved] (SPARK-4377) ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to deserialize a serialized ActorRef without an ActorSystem in scope.
[ https://issues.apache.org/jira/browse/SPARK-4377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4377. Resolution: Fixed > ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to > deserialize a serialized ActorRef without an ActorSystem in scope. > - > > Key: SPARK-4377 > URL: https://issues.apache.org/jira/browse/SPARK-4377 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.3.0 >Reporter: Josh Rosen >Assignee: Prashant Sharma >Priority: Blocker > Fix For: 1.3.0 > > > It looks like ZooKeeperPersistenceEngine is broken in the current Spark > master (23f5bdf06a388e08ea5a69e848f0ecd5165aa481). Here's a log excerpt from > a secondary master when it takes over from a failed primary master: > {code} > 14/11/13 04:37:12 WARN ConnectionStateManager: There are no > ConnectionStateListeners registered. > 14/11/13 04:37:19 INFO Master: Registering worker 172.17.0.223: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:20 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.223: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:43 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:47 INFO Master: Registering worker 172.17.0.223: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:51 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.223: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:38:06 INFO ZooKeeperLeaderElectionAgent: We have gained leadership > 14/11/13 04:38:06 WARN ZooKeeperPersistenceEngine: Exception while reading > persisted file, deleting > java.io.IOException: java.lang.IllegalStateException: Trying to deserialize a > serialized ActorRef without an ActorSystem in scope. Use > 'akka.serialization.Serialization.currentSystem.withValue(system) { ... }' > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:988) > at > org.apache.spark.deploy.master.ApplicationInfo.readObject(ApplicationInfo.scala:51) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.deserializeFromFile(ZooKeeperPersistenceEngine.scala:69) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:54) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:32) > at > org.apache.spark.deploy.master.PersistenceEngine$class.readPersistedData(PersistenceEngine.scala:84) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.readPersistedData(ZooKeeperPersistenceEngine.
[jira] [Commented] (SPARK-4560) Lambda deserialization error
[ https://issues.apache.org/jira/browse/SPARK-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1482#comment-1482 ] Alexis Seigneurin commented on SPARK-4560: -- It looks like the foreach() method is causing an issue. If i replace it with a call to count(), it works fine: {code} TwitterUtils.createStream(sc, twitterAuth, filters) .map(t -> t.getText()) .foreachRDD(tweets -> { System.out.println(tweets.count()); return null; }); {code} > Lambda deserialization error > > > Key: SPARK-4560 > URL: https://issues.apache.org/jira/browse/SPARK-4560 > Project: Spark > Issue Type: Bug >Affects Versions: 1.1.0 > Environment: Java 8.0.25 >Reporter: Alexis Seigneurin > Attachments: IndexTweets.java, pom.xml > > > I'm getting an error saying a lambda could not be deserialized. Here is the > code: > {code} > TwitterUtils.createStream(sc, twitterAuth, filters) > .map(t -> t.getText()) > .foreachRDD(tweets -> { > tweets.foreach(x -> System.out.println(x)); > return null; > }); > {code} > Here is the exception: > {noformat} > java.io.IOException: unexpected exception type > at > java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538) > at > java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1110) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1810) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104) > ... 27 more > Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization > at > com.seigneurin.spark.IndexTweets.$deserializeLambda$(IndexTweets.java:1) > ... 37 more > {noformat} > T
[jira] [Updated] (SPARK-4560) Lambda deserialization error
[ https://issues.apache.org/jira/browse/SPARK-4560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexis Seigneurin updated SPARK-4560: - Attachment: IndexTweets.java pom.xml I'm attaching the class I'm using and Maven's pom.xml file so that you can reproduce the issue. > Lambda deserialization error > > > Key: SPARK-4560 > URL: https://issues.apache.org/jira/browse/SPARK-4560 > Project: Spark > Issue Type: Bug >Affects Versions: 1.1.0 > Environment: Java 8.0.25 >Reporter: Alexis Seigneurin > Attachments: IndexTweets.java, pom.xml > > > I'm getting an error saying a lambda could not be deserialized. Here is the > code: > {code} > TwitterUtils.createStream(sc, twitterAuth, filters) > .map(t -> t.getText()) > .foreachRDD(tweets -> { > tweets.foreach(x -> System.out.println(x)); > return null; > }); > {code} > Here is the exception: > {noformat} > java.io.IOException: unexpected exception type > at > java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538) > at > java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1110) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1810) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104) > ... 27 more > Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization > at > com.seigneurin.spark.IndexTweets.$deserializeLambda$(IndexTweets.java:1) > ... 37 more > {noformat} > The weird thing is, if I write the following code (the map operation is > inside the foreachRDD), it works without problem. > {code} > TwitterUtils.createStream(sc, twitterAuth, filters) > .foreachRDD(tweets -> { > tweets.map(t -> t.g
[jira] [Created] (SPARK-4560) Lambda deserialization error
Alexis Seigneurin created SPARK-4560: Summary: Lambda deserialization error Key: SPARK-4560 URL: https://issues.apache.org/jira/browse/SPARK-4560 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Environment: Java 8.0.25 Reporter: Alexis Seigneurin I'm getting an error saying a lambda could not be deserialized. Here is the code: {code} TwitterUtils.createStream(sc, twitterAuth, filters) .map(t -> t.getText()) .foreachRDD(tweets -> { tweets.foreach(x -> System.out.println(x)); return null; }); {code} Here is the exception: {noformat} java.io.IOException: unexpected exception type at java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1110) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1810) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:57) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at java.lang.invoke.SerializedLambda.readResolve(SerializedLambda.java:230) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1104) ... 27 more Caused by: java.lang.IllegalArgumentException: Invalid lambda deserialization at com.seigneurin.spark.IndexTweets.$deserializeLambda$(IndexTweets.java:1) ... 37 more {noformat} The weird thing is, if I write the following code (the map operation is inside the foreachRDD), it works without problem. {code} TwitterUtils.createStream(sc, twitterAuth, filters) .foreachRDD(tweets -> { tweets.map(t -> t.getText()) .foreach(x -> System.out.println(x)); return null; }); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4517) Improve memory efficiency for python broadcast
[ https://issues.apache.org/jira/browse/SPARK-4517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1475#comment-1475 ] Apache Spark commented on SPARK-4517: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3417 > Improve memory efficiency for python broadcast > -- > > Key: SPARK-4517 > URL: https://issues.apache.org/jira/browse/SPARK-4517 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu > > Currently, the Python broadcast (TorrentBroadcast) will have multiple copies > in : > 1) 1 copy in python driver > 2) 1 copy in disks of driver (serialized and compressed) > 3) 2 copies in JVM driver (one is unserialized, one is serialized and > compressed) > 4) 2 copies in executor (one is unserialized, one is serialized and > compressed) > 5) one copy in each python worker. > Some of them are different in HTTPBroadcast: > 3) one copy in memory of driver, one copy in disk (serialized and compressed) > 4) one copy in memory of executor > If the python broadcast is 4G, then it need 12G in driver, and 8+4x G in > executor (x is the number of python worker, it's the number of CPUs usually). > The Python broadcast is already serialized and compressed in Python, it > should not be serialized and compressed again in JVM. Also, JVM does not need > to know the content of it, so it could be out of JVM. > So, we should have specified broadcast implementation for Python, it stores > the serialized and compressed data in disks, transferred to executors in p2p > way (similar to TorrentBroadcast), sent to python workers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4518) Filestream sometimes processes files twice
[ https://issues.apache.org/jira/browse/SPARK-4518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1473#comment-1473 ] Apache Spark commented on SPARK-4518: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/3419 > Filestream sometimes processes files twice > -- > > Key: SPARK-4518 > URL: https://issues.apache.org/jira/browse/SPARK-4518 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.0.2, 1.1.1 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4519) Filestream does not use hadoop configuration set within sparkContext.hadoopConfiguration
[ https://issues.apache.org/jira/browse/SPARK-4519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1474#comment-1474 ] Apache Spark commented on SPARK-4519: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/3419 > Filestream does not use hadoop configuration set within > sparkContext.hadoopConfiguration > > > Key: SPARK-4519 > URL: https://issues.apache.org/jira/browse/SPARK-4519 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.0.2, 1.1.1 >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4559) Adding support for ucase and lcase
[ https://issues.apache.org/jira/browse/SPARK-4559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1471#comment-1471 ] Apache Spark commented on SPARK-4559: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/3418 > Adding support for ucase and lcase > -- > > Key: SPARK-4559 > URL: https://issues.apache.org/jira/browse/SPARK-4559 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangfei > Fix For: 1.2.0 > > > Adding support for ucase and lcase in spark sql -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4559) Adding support for ucase and lcase
wangfei created SPARK-4559: -- Summary: Adding support for ucase and lcase Key: SPARK-4559 URL: https://issues.apache.org/jira/browse/SPARK-4559 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 Adding support for ucase and lcase in spark sql -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4489) JavaPairRDD.collectAsMap from checkpoint RDD may fail with ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-4489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1462#comment-1462 ] Josh Rosen commented on SPARK-4489: --- It looks like this is still a legitimate issue; the underlying bug is due to the Java API's handling of ClassTags plus incomplete test coverage for the Java API. Regarding the ClassTag workaround in the gist, I think that you might be able to use the {{retag()}} method that I added in the fix to SPARK-1040 to quickly fix this. I may be able to take a look at this reproduction later, but I'm going to leave this unassigned for now since it would be a great starter task for someone to pick up. > JavaPairRDD.collectAsMap from checkpoint RDD may fail with ClassCastException > - > > Key: SPARK-4489 > URL: https://issues.apache.org/jira/browse/SPARK-4489 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.1.0 >Reporter: Christopher Ng > > Calling collectAsMap() on a JavaPairRDD reconstructed from a checkpoint fails > with a ClassCastException: > Exception in thread "main" java.lang.ClassCastException: [Ljava.lang.Object; > cannot be cast to [Lscala.Tuple2; > at > org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:595) > at > org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala:569) > at org.facboy.spark.CheckpointBug.main(CheckpointBug.java:46) > Code sample reproducing the issue: > https://gist.github.com/facboy/8387e950ffb0746a8272 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4530) GradientDescent get a wrong gradient value according to the gradient formula, which is caused by the miniBatchSize parameter.
[ https://issues.apache.org/jira/browse/SPARK-4530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4530: - Priority: Major (was: Blocker) See comments on the PR. I don't think these things rise to the level of 'blocker' > GradientDescent get a wrong gradient value according to the gradient formula, > which is caused by the miniBatchSize parameter. > - > > Key: SPARK-4530 > URL: https://issues.apache.org/jira/browse/SPARK-4530 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.0.0, 1.1.0, 1.2.0 >Reporter: Guoqiang Li > > This bug is caused by {{RDD.sample}} > The number of {{RDD.sample}} returns is not fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4558) History Server waits ~10s before starting up
Andrew Or created SPARK-4558: Summary: History Server waits ~10s before starting up Key: SPARK-4558 URL: https://issues.apache.org/jira/browse/SPARK-4558 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Minor After you call `sbin/start-history-server.sh`, it waits about 10s before actually starting up. I suspect this is a subtle bug related to log checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4490) Not found RandomGenerator through spark-shell
[ https://issues.apache.org/jira/browse/SPARK-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1430#comment-1430 ] Sean Owen commented on SPARK-4490: -- commons-math3 is still a dependency of core, yes. Are you saying this works with HEAD? that would make more sense, but in general I think you still would want to explicitly add breeze and commons-math3 to the classpath if you want to use them in spark-shell rather than rely on them being in the assembly. > Not found RandomGenerator through spark-shell > - > > Key: SPARK-4490 > URL: https://issues.apache.org/jira/browse/SPARK-4490 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 > Environment: spark-shell >Reporter: Kai Sasaki > > In spark-1.1.0, exception is threw whenever RandomGenerator of commons-math3 > is used. There is some workaround about this problem. > http://find.searchhub.org/document/6df0c89201dfe386#10bd443c4da849a3 > ``` > scala> import breeze.linalg._ > import breeze.linalg._ > scala> Matrix.rand[Double](3, 3) > java.lang.NoClassDefFoundError: > org/apache/commons/math3/random/RandomGenerator > at > breeze.linalg.MatrixConstructors$class.rand$default$3(Matrix.scala:205) > at breeze.linalg.Matrix$.rand$default$3(Matrix.scala:139) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:14) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:19) > at $iwC$$iwC$$iwC$$iwC.(:21) > at $iwC$$iwC$$iwC.(:23) > at $iwC$$iwC.(:25) > at $iwC.(:27) > at (:29) > at .(:33) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) > at > org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) > at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) > at > org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > aused by: java.lang.ClassNotFoundException: > org.apache.commons.math3.random.RandomGenerator > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > ... 44 more > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) ---
[jira] [Updated] (SPARK-4557) Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a Function<..., Void>
[ https://issues.apache.org/jira/browse/SPARK-4557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4557: - Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) (Don't think this is a bug, really.) Yes, it's possible VoidFunction didn't exist when this API was defined. It can't be changed now without breaking API compatibility but AFAICT VoidFunction would be more appropriate. Maybe this can happen with some other related Java API rationalization in Spark 2.x. > Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a > Function<..., Void> > --- > > Key: SPARK-4557 > URL: https://issues.apache.org/jira/browse/SPARK-4557 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Alexis Seigneurin >Priority: Minor > > In *Java*, using Spark Streaming's foreachRDD function is quite verbose. You > have to write: > {code:java} > .foreachRDD(items -> { > ...; > return null; > }); > {code} > Instead of: > {code:java} > .foreachRDD(items -> ...); > {code} > This is because the foreachRDD method accepts a Function, Void> > instead of a VoidFunction>. This would make sense to change it > to a VoidFunction as, in Spark's API, the foreach method already accepts a > VoidFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4556) binary distribution assembly can't run in local mode
[ https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1415#comment-1415 ] Patrick Wendell edited comment on SPARK-4556 at 11/22/14 10:17 PM: --- Checkout make-distribution.sh rather than using maven directly. We might consider removing that maven target since I don't think it's actively maintained. We should document clearly that make-distribution.sh is way of building binaries. was (Author: pwendell): Checkout make-distribution.sh rather than using maven directly. We might consider removing that maven target since I don't think it's actively maintained. > binary distribution assembly can't run in local mode > > > Key: SPARK-4556 > URL: https://issues.apache.org/jira/browse/SPARK-4556 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Reporter: Sean Busbey > > After building the binary distribution assembly, the resultant tarball can't > be used for local mode. > {code} > busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package > [INFO] Scanning for projects... > ...SNIP... > [INFO] > > [INFO] Reactor Summary: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 > s] > [INFO] Spark Project Networking ... SUCCESS [ 31.402 > s] > [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 8.864 > s] > [INFO] Spark Project Core . SUCCESS [15:39 > min] > [INFO] Spark Project Bagel SUCCESS [ 29.470 > s] > [INFO] Spark Project GraphX ... SUCCESS [05:20 > min] > [INFO] Spark Project Streaming SUCCESS [11:02 > min] > [INFO] Spark Project Catalyst . SUCCESS [11:26 > min] > [INFO] Spark Project SQL .. SUCCESS [11:33 > min] > [INFO] Spark Project ML Library ... SUCCESS [14:27 > min] > [INFO] Spark Project Tools SUCCESS [ 40.980 > s] > [INFO] Spark Project Hive . SUCCESS [11:45 > min] > [INFO] Spark Project REPL . SUCCESS [03:15 > min] > [INFO] Spark Project Assembly . SUCCESS [04:22 > min] > [INFO] Spark Project External Twitter . SUCCESS [ 43.567 > s] > [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 > s] > [INFO] Spark Project External Flume ... SUCCESS [01:41 > min] > [INFO] Spark Project External MQTT SUCCESS [ 40.973 > s] > [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 > s] > [INFO] Spark Project External Kafka ... SUCCESS [01:23 > min] > [INFO] Spark Project Examples . SUCCESS [10:19 > min] > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 01:47 h > [INFO] Finished at: 2014-11-22T02:13:51-06:00 > [INFO] Final Memory: 79M/2759M > [INFO] > > busbey2-MBA:spark busbey$ cd assembly/target/ > busbey2-MBA:target busbey$ mkdir dist-temp > busbey2-MBA:target busbey$ tar -C dist-temp -xzf > spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz > busbey2-MBA:target busbey$ cd dist-temp/ > busbey2-MBA:dist-temp busbey$ ./bin/spark-shell > ls: > /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10: > No such file or directory > Failed to find Spark assembly in > /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10 > You need to build Spark before running this program. > {code} > It looks like the classpath calculations in {{bin/compute_classpath.sh}} > don't handle it. > If I move all of the spark-*.jar files from the top level into the lib folder > and touch the RELEASE file, then the spark shell launches in local mode > normally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4556) binary distribution assembly can't run in local mode
[ https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1415#comment-1415 ] Patrick Wendell commented on SPARK-4556: Checkout make-distribution.sh rather than using maven directly. We might consider removing that maven target since I don't think it's actively maintained. > binary distribution assembly can't run in local mode > > > Key: SPARK-4556 > URL: https://issues.apache.org/jira/browse/SPARK-4556 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Reporter: Sean Busbey > > After building the binary distribution assembly, the resultant tarball can't > be used for local mode. > {code} > busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package > [INFO] Scanning for projects... > ...SNIP... > [INFO] > > [INFO] Reactor Summary: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 > s] > [INFO] Spark Project Networking ... SUCCESS [ 31.402 > s] > [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 8.864 > s] > [INFO] Spark Project Core . SUCCESS [15:39 > min] > [INFO] Spark Project Bagel SUCCESS [ 29.470 > s] > [INFO] Spark Project GraphX ... SUCCESS [05:20 > min] > [INFO] Spark Project Streaming SUCCESS [11:02 > min] > [INFO] Spark Project Catalyst . SUCCESS [11:26 > min] > [INFO] Spark Project SQL .. SUCCESS [11:33 > min] > [INFO] Spark Project ML Library ... SUCCESS [14:27 > min] > [INFO] Spark Project Tools SUCCESS [ 40.980 > s] > [INFO] Spark Project Hive . SUCCESS [11:45 > min] > [INFO] Spark Project REPL . SUCCESS [03:15 > min] > [INFO] Spark Project Assembly . SUCCESS [04:22 > min] > [INFO] Spark Project External Twitter . SUCCESS [ 43.567 > s] > [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 > s] > [INFO] Spark Project External Flume ... SUCCESS [01:41 > min] > [INFO] Spark Project External MQTT SUCCESS [ 40.973 > s] > [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 > s] > [INFO] Spark Project External Kafka ... SUCCESS [01:23 > min] > [INFO] Spark Project Examples . SUCCESS [10:19 > min] > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 01:47 h > [INFO] Finished at: 2014-11-22T02:13:51-06:00 > [INFO] Final Memory: 79M/2759M > [INFO] > > busbey2-MBA:spark busbey$ cd assembly/target/ > busbey2-MBA:target busbey$ mkdir dist-temp > busbey2-MBA:target busbey$ tar -C dist-temp -xzf > spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz > busbey2-MBA:target busbey$ cd dist-temp/ > busbey2-MBA:dist-temp busbey$ ./bin/spark-shell > ls: > /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10: > No such file or directory > Failed to find Spark assembly in > /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10 > You need to build Spark before running this program. > {code} > It looks like the classpath calculations in {{bin/compute_classpath.sh}} > don't handle it. > If I move all of the spark-*.jar files from the top level into the lib folder > and touch the RELEASE file, then the spark shell launches in local mode > normally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4556) binary distribution assembly can't run in local mode
[ https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1410#comment-1410 ] Sean Busbey commented on SPARK-4556: Well, why does the layout of the binary distribution differ from the layout in a release? At a minimum the README should be updated to clarify the purpose of the binary distribution. Preferably, the README should include instructions for taking the binary distribution and deploying it to be runnable. > binary distribution assembly can't run in local mode > > > Key: SPARK-4556 > URL: https://issues.apache.org/jira/browse/SPARK-4556 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Reporter: Sean Busbey > > After building the binary distribution assembly, the resultant tarball can't > be used for local mode. > {code} > busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package > [INFO] Scanning for projects... > ...SNIP... > [INFO] > > [INFO] Reactor Summary: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 > s] > [INFO] Spark Project Networking ... SUCCESS [ 31.402 > s] > [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 8.864 > s] > [INFO] Spark Project Core . SUCCESS [15:39 > min] > [INFO] Spark Project Bagel SUCCESS [ 29.470 > s] > [INFO] Spark Project GraphX ... SUCCESS [05:20 > min] > [INFO] Spark Project Streaming SUCCESS [11:02 > min] > [INFO] Spark Project Catalyst . SUCCESS [11:26 > min] > [INFO] Spark Project SQL .. SUCCESS [11:33 > min] > [INFO] Spark Project ML Library ... SUCCESS [14:27 > min] > [INFO] Spark Project Tools SUCCESS [ 40.980 > s] > [INFO] Spark Project Hive . SUCCESS [11:45 > min] > [INFO] Spark Project REPL . SUCCESS [03:15 > min] > [INFO] Spark Project Assembly . SUCCESS [04:22 > min] > [INFO] Spark Project External Twitter . SUCCESS [ 43.567 > s] > [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 > s] > [INFO] Spark Project External Flume ... SUCCESS [01:41 > min] > [INFO] Spark Project External MQTT SUCCESS [ 40.973 > s] > [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 > s] > [INFO] Spark Project External Kafka ... SUCCESS [01:23 > min] > [INFO] Spark Project Examples . SUCCESS [10:19 > min] > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 01:47 h > [INFO] Finished at: 2014-11-22T02:13:51-06:00 > [INFO] Final Memory: 79M/2759M > [INFO] > > busbey2-MBA:spark busbey$ cd assembly/target/ > busbey2-MBA:target busbey$ mkdir dist-temp > busbey2-MBA:target busbey$ tar -C dist-temp -xzf > spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz > busbey2-MBA:target busbey$ cd dist-temp/ > busbey2-MBA:dist-temp busbey$ ./bin/spark-shell > ls: > /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10: > No such file or directory > Failed to find Spark assembly in > /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10 > You need to build Spark before running this program. > {code} > It looks like the classpath calculations in {{bin/compute_classpath.sh}} > don't handle it. > If I move all of the spark-*.jar files from the top level into the lib folder > and touch the RELEASE file, then the spark shell launches in local mode > normally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4556) binary distribution assembly can't run in local mode
[ https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1406#comment-1406 ] Sean Owen commented on SPARK-4556: -- Hm, but is that a bug? I think compute-classpath.sh is designed to support running from the project root in development, or running from the files as laid out in the release, at least judging from your comments and the script itself. I don't think the raw contents of the assembly JAR themselves are a runnable installation. > binary distribution assembly can't run in local mode > > > Key: SPARK-4556 > URL: https://issues.apache.org/jira/browse/SPARK-4556 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Reporter: Sean Busbey > > After building the binary distribution assembly, the resultant tarball can't > be used for local mode. > {code} > busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package > [INFO] Scanning for projects... > ...SNIP... > [INFO] > > [INFO] Reactor Summary: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 > s] > [INFO] Spark Project Networking ... SUCCESS [ 31.402 > s] > [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 8.864 > s] > [INFO] Spark Project Core . SUCCESS [15:39 > min] > [INFO] Spark Project Bagel SUCCESS [ 29.470 > s] > [INFO] Spark Project GraphX ... SUCCESS [05:20 > min] > [INFO] Spark Project Streaming SUCCESS [11:02 > min] > [INFO] Spark Project Catalyst . SUCCESS [11:26 > min] > [INFO] Spark Project SQL .. SUCCESS [11:33 > min] > [INFO] Spark Project ML Library ... SUCCESS [14:27 > min] > [INFO] Spark Project Tools SUCCESS [ 40.980 > s] > [INFO] Spark Project Hive . SUCCESS [11:45 > min] > [INFO] Spark Project REPL . SUCCESS [03:15 > min] > [INFO] Spark Project Assembly . SUCCESS [04:22 > min] > [INFO] Spark Project External Twitter . SUCCESS [ 43.567 > s] > [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 > s] > [INFO] Spark Project External Flume ... SUCCESS [01:41 > min] > [INFO] Spark Project External MQTT SUCCESS [ 40.973 > s] > [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 > s] > [INFO] Spark Project External Kafka ... SUCCESS [01:23 > min] > [INFO] Spark Project Examples . SUCCESS [10:19 > min] > [INFO] > > [INFO] BUILD SUCCESS > [INFO] > > [INFO] Total time: 01:47 h > [INFO] Finished at: 2014-11-22T02:13:51-06:00 > [INFO] Final Memory: 79M/2759M > [INFO] > > busbey2-MBA:spark busbey$ cd assembly/target/ > busbey2-MBA:target busbey$ mkdir dist-temp > busbey2-MBA:target busbey$ tar -C dist-temp -xzf > spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz > busbey2-MBA:target busbey$ cd dist-temp/ > busbey2-MBA:dist-temp busbey$ ./bin/spark-shell > ls: > /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10: > No such file or directory > Failed to find Spark assembly in > /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10 > You need to build Spark before running this program. > {code} > It looks like the classpath calculations in {{bin/compute_classpath.sh}} > don't handle it. > If I move all of the spark-*.jar files from the top level into the lib folder > and touch the RELEASE file, then the spark shell launches in local mode > normally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4377) ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to deserialize a serialized ActorRef without an ActorSystem in scope.
[ https://issues.apache.org/jira/browse/SPARK-4377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4377: --- Fix Version/s: 1.3.0 > ZooKeeperPersistenceEngine: java.lang.IllegalStateException: Trying to > deserialize a serialized ActorRef without an ActorSystem in scope. > - > > Key: SPARK-4377 > URL: https://issues.apache.org/jira/browse/SPARK-4377 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.2.0 >Reporter: Josh Rosen >Assignee: Prashant Sharma >Priority: Blocker > Fix For: 1.3.0 > > > It looks like ZooKeeperPersistenceEngine is broken in the current Spark > master (23f5bdf06a388e08ea5a69e848f0ecd5165aa481). Here's a log excerpt from > a secondary master when it takes over from a failed primary master: > {code} > 14/11/13 04:37:12 WARN ConnectionStateManager: There are no > ConnectionStateListeners registered. > 14/11/13 04:37:19 INFO Master: Registering worker 172.17.0.223: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:20 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.223: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:35 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:43 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:47 INFO Master: Registering worker 172.17.0.223: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:51 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.223: with 8 > cores, 984.0 MB RAM > 14/11/13 04:37:59 INFO Master: Registering worker 172.17.0.224: with 8 > cores, 984.0 MB RAM > 14/11/13 04:38:06 INFO ZooKeeperLeaderElectionAgent: We have gained leadership > 14/11/13 04:38:06 WARN ZooKeeperPersistenceEngine: Exception while reading > persisted file, deleting > java.io.IOException: java.lang.IllegalStateException: Trying to deserialize a > serialized ActorRef without an ActorSystem in scope. Use > 'akka.serialization.Serialization.currentSystem.withValue(system) { ... }' > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:988) > at > org.apache.spark.deploy.master.ApplicationInfo.readObject(ApplicationInfo.scala:51) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:81) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.deserializeFromFile(ZooKeeperPersistenceEngine.scala:69) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$1.apply(ZooKeeperPersistenceEngine.scala:54) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:54) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:32) > at > org.apache.spark.deploy.master.PersistenceEngine$class.readPersistedData(PersistenceEngine.scala:84) > at > org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.readPersistedData(ZooKeeperPersistenceEngine
[jira] [Created] (SPARK-4557) Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a Function<..., Void>
Alexis Seigneurin created SPARK-4557: Summary: Spark Streaming' foreachRDD method should accept a VoidFunction<...>, not a Function<..., Void> Key: SPARK-4557 URL: https://issues.apache.org/jira/browse/SPARK-4557 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Alexis Seigneurin In *Java*, using Spark Streaming's foreachRDD function is quite verbose. You have to write: {code:java} .foreachRDD(items -> { ...; return null; }); {code} Instead of: {code:java} .foreachRDD(items -> ...); {code} This is because the foreachRDD method accepts a Function, Void> instead of a VoidFunction>. This would make sense to change it to a VoidFunction as, in Spark's API, the foreach method already accepts a VoidFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4556) binary distribution assembly can't run in local mode
[ https://issues.apache.org/jira/browse/SPARK-4556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Busbey updated SPARK-4556: --- Description: After building the binary distribution assembly, the resultant tarball can't be used for local mode. {code} busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package [INFO] Scanning for projects... ...SNIP... [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 s] [INFO] Spark Project Networking ... SUCCESS [ 31.402 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 8.864 s] [INFO] Spark Project Core . SUCCESS [15:39 min] [INFO] Spark Project Bagel SUCCESS [ 29.470 s] [INFO] Spark Project GraphX ... SUCCESS [05:20 min] [INFO] Spark Project Streaming SUCCESS [11:02 min] [INFO] Spark Project Catalyst . SUCCESS [11:26 min] [INFO] Spark Project SQL .. SUCCESS [11:33 min] [INFO] Spark Project ML Library ... SUCCESS [14:27 min] [INFO] Spark Project Tools SUCCESS [ 40.980 s] [INFO] Spark Project Hive . SUCCESS [11:45 min] [INFO] Spark Project REPL . SUCCESS [03:15 min] [INFO] Spark Project Assembly . SUCCESS [04:22 min] [INFO] Spark Project External Twitter . SUCCESS [ 43.567 s] [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 s] [INFO] Spark Project External Flume ... SUCCESS [01:41 min] [INFO] Spark Project External MQTT SUCCESS [ 40.973 s] [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 s] [INFO] Spark Project External Kafka ... SUCCESS [01:23 min] [INFO] Spark Project Examples . SUCCESS [10:19 min] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 01:47 h [INFO] Finished at: 2014-11-22T02:13:51-06:00 [INFO] Final Memory: 79M/2759M [INFO] busbey2-MBA:spark busbey$ cd assembly/target/ busbey2-MBA:target busbey$ mkdir dist-temp busbey2-MBA:target busbey$ tar -C dist-temp -xzf spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz busbey2-MBA:target busbey$ cd dist-temp/ busbey2-MBA:dist-temp busbey$ ./bin/spark-shell ls: /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10: No such file or directory Failed to find Spark assembly in /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10 You need to build Spark before running this program. {code} It looks like the classpath calculations in {{bin/compute_classpath.sh}} don't handle it. If I move all of the spark-*.jar files from the top level into the lib folder and touch the RELEASE file, then the spark shell launches in local mode normally. was: After building the binary distribution assembly, the resultant tarball can't be used for local mode. {code} busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package [INFO] Scanning for projects... ...SNIP... [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 s] [INFO] Spark Project Networking ... SUCCESS [ 31.402 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 8.864 s] [INFO] Spark Project Core . SUCCESS [15:39 min] [INFO] Spark Project Bagel SUCCESS [ 29.470 s] [INFO] Spark Project GraphX ... SUCCESS [05:20 min] [INFO] Spark Project Streaming SUCCESS [11:02 min] [INFO] Spark Project Catalyst . SUCCESS [11:26 min] [INFO] Spark Project SQL .. SUCCESS [11:33 min] [INFO] Spark Project ML Library ... SUCCESS [14:27 min] [INFO] Spark Project Tools SUCCESS [ 40.980 s] [INFO] Spark Project Hive . SUCCESS [11:45 min] [INFO] Spark Project REPL . SUCCESS [03:15 min] [INFO] Spark Project Assembly . SUCCESS [04:22 min] [INFO] Spark Project External Twitter . SUCCESS [ 43.567 s] [INFO] Spark Project External Flume Sink .
[jira] [Created] (SPARK-4556) binary distribution assembly can't run in local mode
Sean Busbey created SPARK-4556: -- Summary: binary distribution assembly can't run in local mode Key: SPARK-4556 URL: https://issues.apache.org/jira/browse/SPARK-4556 Project: Spark Issue Type: Bug Components: Build, Spark Shell Reporter: Sean Busbey After building the binary distribution assembly, the resultant tarball can't be used for local mode. {code} busbey2-MBA:spark busbey$ mvn -Pbigtop-dist -DskipTests=true package [INFO] Scanning for projects... ...SNIP... [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 32.227 s] [INFO] Spark Project Networking ... SUCCESS [ 31.402 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 8.864 s] [INFO] Spark Project Core . SUCCESS [15:39 min] [INFO] Spark Project Bagel SUCCESS [ 29.470 s] [INFO] Spark Project GraphX ... SUCCESS [05:20 min] [INFO] Spark Project Streaming SUCCESS [11:02 min] [INFO] Spark Project Catalyst . SUCCESS [11:26 min] [INFO] Spark Project SQL .. SUCCESS [11:33 min] [INFO] Spark Project ML Library ... SUCCESS [14:27 min] [INFO] Spark Project Tools SUCCESS [ 40.980 s] [INFO] Spark Project Hive . SUCCESS [11:45 min] [INFO] Spark Project REPL . SUCCESS [03:15 min] [INFO] Spark Project Assembly . SUCCESS [04:22 min] [INFO] Spark Project External Twitter . SUCCESS [ 43.567 s] [INFO] Spark Project External Flume Sink .. SUCCESS [ 50.367 s] [INFO] Spark Project External Flume ... SUCCESS [01:41 min] [INFO] Spark Project External MQTT SUCCESS [ 40.973 s] [INFO] Spark Project External ZeroMQ .. SUCCESS [ 54.878 s] [INFO] Spark Project External Kafka ... SUCCESS [01:23 min] [INFO] Spark Project Examples . SUCCESS [10:19 min] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 01:47 h [INFO] Finished at: 2014-11-22T02:13:51-06:00 [INFO] Final Memory: 79M/2759M [INFO] {code} busbey2-MBA:spark busbey$ cd assembly/target/ busbey2-MBA:target busbey$ mkdir dist-temp busbey2-MBA:target busbey$ tar -C dist-temp -xzf spark-assembly_2.10-1.3.0-SNAPSHOT-dist.tar.gz busbey2-MBA:target busbey$ cd dist-temp/ busbey2-MBA:dist-temp busbey$ ./bin/spark-shell ls: /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10: No such file or directory Failed to find Spark assembly in /Users/busbey/projects/spark/assembly/target/dist-temp/assembly/target/scala-2.10 You need to build Spark before running this program. {code} It looks like the classpath calculations in {{bin/compute_classpath.sh}} don't handle it. If I move all of the spark-*.jar files from the top level into the lib folder and touch the RELEASE file, then the spark shell launches in local mode normally. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4507) PR merge script should support closing multiple JIRA tickets
[ https://issues.apache.org/jira/browse/SPARK-4507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4507: --- Labels: starter (was: ) > PR merge script should support closing multiple JIRA tickets > > > Key: SPARK-4507 > URL: https://issues.apache.org/jira/browse/SPARK-4507 > Project: Spark > Issue Type: Bug > Components: Project Infra >Reporter: Josh Rosen >Priority: Minor > Labels: starter > > For pull requests that reference multiple JIRAs in their titles, it would be > helpful if the PR merge script offered to close all of them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1517: --- Priority: Critical (was: Major) > Publish nightly snapshots of documentation, maven artifacts, and binary builds > -- > > Key: SPARK-1517 > URL: https://issues.apache.org/jira/browse/SPARK-1517 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Patrick Wendell >Priority: Critical > > Should be pretty easy to do with Jenkins. The only thing I can think of that > would be tricky is to set up credentials so that jenkins can publish this > stuff somewhere on apache infra. > Ideally we don't want to have to put a private key on every jenkins box > (since they are otherwise pretty stateless). One idea is to encrypt these > credentials with a passphrase and post them somewhere publicly visible. Then > the jenkins build can download the credentials provided we set a passphrase > in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1517: --- Target Version/s: 1.3.0 > Publish nightly snapshots of documentation, maven artifacts, and binary builds > -- > > Key: SPARK-1517 > URL: https://issues.apache.org/jira/browse/SPARK-1517 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Patrick Wendell > > Should be pretty easy to do with Jenkins. The only thing I can think of that > would be tricky is to set up credentials so that jenkins can publish this > stuff somewhere on apache infra. > Ideally we don't want to have to put a private key on every jenkins box > (since they are otherwise pretty stateless). One idea is to encrypt these > credentials with a passphrase and post them somewhere publicly visible. Then > the jenkins build can download the credentials provided we set a passphrase > in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4542) Post nightly releases
[ https://issues.apache.org/jira/browse/SPARK-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4542. Resolution: Duplicate > Post nightly releases > - > > Key: SPARK-4542 > URL: https://issues.apache.org/jira/browse/SPARK-4542 > Project: Spark > Issue Type: Improvement >Reporter: Arun Ahuja > > Spark developers are continually including new improvements and fixes to > sometimes critfical issues. To speed up review and resolve the issues faster > it will faster for multiple people to test ( or use those fixes if they are > critical ) if there are 1) snapshots to maven and 2) Full > distribution/scripts perhaps posted somewhere. Otherwise each individual > developer has to pull and rebuild which maybe a very long process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1517) Publish nightly snapshots of documentation, maven artifacts, and binary builds
[ https://issues.apache.org/jira/browse/SPARK-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1517: --- Fix Version/s: (was: 1.2.0) > Publish nightly snapshots of documentation, maven artifacts, and binary builds > -- > > Key: SPARK-1517 > URL: https://issues.apache.org/jira/browse/SPARK-1517 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Patrick Wendell > > Should be pretty easy to do with Jenkins. The only thing I can think of that > would be tricky is to set up credentials so that jenkins can publish this > stuff somewhere on apache infra. > Ideally we don't want to have to put a private key on every jenkins box > (since they are otherwise pretty stateless). One idea is to encrypt these > credentials with a passphrase and post them somewhere publicly visible. Then > the jenkins build can download the credentials provided we set a passphrase > in an environment variable in jenkins. There may be simpler solutions as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2143) Display Spark version on Driver web page
[ https://issues.apache.org/jira/browse/SPARK-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2143: --- Priority: Critical (was: Major) > Display Spark version on Driver web page > > > Key: SPARK-2143 > URL: https://issues.apache.org/jira/browse/SPARK-2143 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Jeff Hammerbacher >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4516) Netty off-heap memory use causes executors to be killed by OS
[ https://issues.apache.org/jira/browse/SPARK-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222172#comment-14222172 ] Patrick Wendell commented on SPARK-4516: Okay then I think this is just a documentation issue. We should add the documentation about direct buffers to the main configuration page and also mention it in the doc about network options. > Netty off-heap memory use causes executors to be killed by OS > - > > Key: SPARK-4516 > URL: https://issues.apache.org/jira/browse/SPARK-4516 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.2.0 > Environment: Linux, Mesos >Reporter: Hector Yee >Priority: Critical > Labels: netty, shuffle > > The netty block transfer manager has a race condition where it closes an > active connection resulting in the error below. Switching to nio seems to > alleviate the problem. > {code} > 14/11/20 18:53:43 INFO TransportClientFactory: Found inactive connection to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773, closing it. > 14/11/20 18:53:43 ERROR RetryingBlockFetcher: Exception while beginning fetch > of 1 outstanding blocks > java.io.IOException: Failed to connect to > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:141) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) > at > org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:87) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:148) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:288) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:246) > at > com.airbnb.common.ml.training.LinearRankerTrainer$$anonfun$7.apply(LinearRankerTrainer.scala:235) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.ConnectException: Connection refused: > i-974cd879.inst.aws.airbnb.com/10.154.228.43:57773 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287) > at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --
[jira] [Commented] (SPARK-4548) Python broadcast is very slow
[ https://issues.apache.org/jira/browse/SPARK-4548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222154#comment-14222154 ] Apache Spark commented on SPARK-4548: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/3417 > Python broadcast is very slow > - > > Key: SPARK-4548 > URL: https://issues.apache.org/jira/browse/SPARK-4548 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.2.0 >Reporter: Davies Liu > > Python broadcast in 1.2 is much slower than 1.1: > In spark-perf tests: > name1.1 1.2 speedup > python-broadcast-w-set3.6316.68 -78.23% -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4555) Add forward compatibility tests to JsonProtocol
Josh Rosen created SPARK-4555: - Summary: Add forward compatibility tests to JsonProtocol Key: SPARK-4555 URL: https://issues.apache.org/jira/browse/SPARK-4555 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Josh Rosen The web UI / event listener's JsonProtocol is designed to be backwards- and forwards-compatible: newer versions of Spark should be able to consume event logs written by older versions and vice-versa. We currently have backwards-compatibility tests for the "newer version reads older log" case; this JIRA tracks progress for adding the opposite forwards-compatibility tests. This type of test could be non-trivial to write, since I think we'd need to actually run a script against multiple compiled Spark releases, so this test might need to sit outside of Spark Core itself as part of an integration testing suite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222112#comment-14222112 ] Evan Sparks commented on SPARK-1405: Bucket has been created: s3://files.sparks.requester.pays/enwiki_category_text/ - All in all there are 181 ~50mb files (actually closer to 10GB). It probably makes sense to use http://sweble.org/ or something to strip the boilerplate, etc. from the documents for the purposes of topic modeling. > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222108#comment-14222108 ] Debasish Das edited comment on SPARK-1405 at 11/22/14 6:40 PM: --- [~sparks] that will be awesome...I should be fine running experiments on EC2... was (Author: debasish83): @sparks that will be awesome...I should be fine running experiments on EC2... > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222108#comment-14222108 ] Debasish Das commented on SPARK-1405: - @sparks that will be awesome...I should be fine running experiments on EC2... > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222105#comment-14222105 ] Evan Sparks commented on SPARK-1405: [~gq] - Those are great numbers for a very high number of topics - it's a little tough to follow what's leading to the super-linear scaling in #topics in your code, though. Are you using FastLDA or something similar to speed up sampling? (http://www.ics.uci.edu/~newman/pubs/fastlda.pdf) Pedro has been testing on a wikipedia dump on s3 which I provided. It's XML formatted, one document per line, so it's easy to parse. I will copy this to a requester-pays bucket (which will be free if you run your experiments on ec2) now so that everyone working on this can use it for testing. NIPS dataset seems fine for small-scale testing, but I think it's important that we test this implementation across a range of values for documents, words, topics, and tokens - hence, I think the data generator that Pedro is working on is a really good idea (and follows the convention of the existing data generators in MLlib). We'll have to be a little careful here, because some of the methods for making LDA fast rely on the fact that it tends to converge fast, and I expect that data generated by the model will be much easier to fit than real data. Also, can we try and be consistent in our terminology - getting the # of unique words confused with all the words in a corpus is easy. I propose "words" and "tokens" for these two things. > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222089#comment-14222089 ] Debasish Das commented on SPARK-1405: - NIPS dataset is common for PLSA and additive regularization based matrix factorization formulations as well since the experiments in this paper focused on the NIPS dataset as well... http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf I will be using NIPS dataset for quality experiments but for scaling experiments, wiki data is good...wiki data was demo-ed by Databricks in last spark summit...it will be great if we can get it from that demo > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222048#comment-14222048 ] Guoqiang Li commented on SPARK-1405: Sorry, I mean the wikipedia data download URL. How much text we need it? I think one billion words is appropriate. > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222036#comment-14222036 ] Pedro Rodriguez commented on SPARK-1405: Not sure which download URL you are referring to? > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222033#comment-14222033 ] Guoqiang Li commented on SPARK-1405: OK, Where is the download URL? > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222030#comment-14222030 ] Pedro Rodriguez commented on SPARK-1405: I don't know of a larger data set, but I am working on an LDA data set generator based on the generative model. It should be good for benchmark testing but still be reasonable from the ML perspective. The metric is in the LDA code (which is turned on and off with a flag on the LDA model). You can find it here in the logLikelihood function: https://github.com/EntilZha/spark/blob/LDA/graphx/src/main/scala/org/apache/spark/graphx/lib/LDA.scala > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222027#comment-14222027 ] Debasish Das commented on SPARK-1405: - [~pedrorodriguez] did you write the metric in your repo as well ? That way I don't have to code it up again.. > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222024#comment-14222024 ] Debasish Das edited comment on SPARK-1405 at 11/22/14 4:22 PM: --- We need a larger dataset as well where topics go to the range of 1+...That range will stress factorization based LSA formulations since there is broadcast of factors at each stepNIPS dataset is small...Let's start with that...But we should test a large dataset like wikipedia as well..If there is a pre-processed version from either mahout or scikit-learn we can use that ? was (Author: debasish83): We need a larger dataset as well where topics go to the range of 1+...That range will stress factorization based LSA formulations since there is broadcast of factors at each stepNIPS dataset is small...you guy's will be willing to test a wikipedia dataset for example ? If there is a pre-processed version from either mahout or scikit-learn we can use that ? > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222024#comment-14222024 ] Debasish Das commented on SPARK-1405: - We need a larger dataset as well where topics go to the range of 1+...That range will stress factorization based LSA formulations since there is broadcast of factors at each stepNIPS dataset is small...you guy's will be willing to test a wikipedia dataset for example ? If there is a pre-processed version from either mahout or scikit-learn we can use that ? > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4554) Set fair scheduler pool for JDBC client session in hive 13
[ https://issues.apache.org/jira/browse/SPARK-4554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221950#comment-14221950 ] Apache Spark commented on SPARK-4554: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/3416 > Set fair scheduler pool for JDBC client session in hive 13 > -- > > Key: SPARK-4554 > URL: https://issues.apache.org/jira/browse/SPARK-4554 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangfei > Fix For: 1.2.0 > > > Now hive 13 shim does not support to set fair scheduler pool -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4554) Set fair scheduler pool for JDBC client session in hive 13
wangfei created SPARK-4554: -- Summary: Set fair scheduler pool for JDBC client session in hive 13 Key: SPARK-4554 URL: https://issues.apache.org/jira/browse/SPARK-4554 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 Now hive 13 shim does not support to set fair scheduler pool -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4288) Add Sparse Autoencoder algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221895#comment-14221895 ] Kai Sasaki commented on SPARK-4288: --- [~mengxr] Thank you. I'll join. > Add Sparse Autoencoder algorithm to MLlib > -- > > Key: SPARK-4288 > URL: https://issues.apache.org/jira/browse/SPARK-4288 > Project: Spark > Issue Type: Wish > Components: MLlib >Reporter: Guoqiang Li > Labels: features > > Are you proposing an implementation? Is it related to the neural network JIRA? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4553) query for parquet table with string fields in spark sql hive get binary result
[ https://issues.apache.org/jira/browse/SPARK-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221892#comment-14221892 ] Apache Spark commented on SPARK-4553: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/3414 > query for parquet table with string fields in spark sql hive get binary result > -- > > Key: SPARK-4553 > URL: https://issues.apache.org/jira/browse/SPARK-4553 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangfei > Fix For: 1.2.0 > > > run > create table test_parquet(key int, value string) stored as parquet; > insert into table test_parquet select * from src; > select * from test_parquet; > get result as follow > ... > 282 [B@38fda3b > 138 [B@1407a24 > 238 [B@12de6fb > 419 [B@6c97695 > 15 [B@4885067 > 118 [B@156a8d3 > 72 [B@65d20dd > 90 [B@4c18906 > 307 [B@60b24cc > 19 [B@59cf51b > 435 [B@39fdf37 > 10 [B@4f799d7 > 277 [B@3950951 > 273 [B@596bf4b > 306 [B@3e91557 > 224 [B@3781d61 > 309 [B@2d0d128 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4552) query for empty parquet table in spark sql hive get IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-4552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221889#comment-14221889 ] Apache Spark commented on SPARK-4552: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/3413 > query for empty parquet table in spark sql hive get IllegalArgumentException > > > Key: SPARK-4552 > URL: https://issues.apache.org/jira/browse/SPARK-4552 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangfei > Fix For: 1.2.0 > > > run > create table test_parquet(key int, value string) stored as parquet; > select * from test_parquet; > get error as follow > java.lang.IllegalArgumentException: Could not find Parquet metadata at path > file:/user/hive/warehouse/test_parquet > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.sc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4553) query for parquet table with string fields in spark sql hive get binary result
wangfei created SPARK-4553: -- Summary: query for parquet table with string fields in spark sql hive get binary result Key: SPARK-4553 URL: https://issues.apache.org/jira/browse/SPARK-4553 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 run create table test_parquet(key int, value string) stored as parquet; insert into table test_parquet select * from src; select * from test_parquet; get result as follow ... 282 [B@38fda3b 138 [B@1407a24 238 [B@12de6fb 419 [B@6c97695 15 [B@4885067 118 [B@156a8d3 72 [B@65d20dd 90 [B@4c18906 307 [B@60b24cc 19 [B@59cf51b 435 [B@39fdf37 10 [B@4f799d7 277 [B@3950951 273 [B@596bf4b 306 [B@3e91557 224 [B@3781d61 309 [B@2d0d128 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4552) query for empty parquet table in spark sql hive get IllegalArgumentException
wangfei created SPARK-4552: -- Summary: query for empty parquet table in spark sql hive get IllegalArgumentException Key: SPARK-4552 URL: https://issues.apache.org/jira/browse/SPARK-4552 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: wangfei Fix For: 1.2.0 run create table test_parquet(key int, value string) stored as parquet; select * from test_parquet; get error as follow java.lang.IllegalArgumentException: Could not find Parquet metadata at path file:/user/hive/warehouse/test_parquet at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) at org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$readMetaData$4.apply(ParquetTypes.scala:459) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetTypesConverter$.readMetaData(ParquetTypes.sc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org