[jira] [Assigned] (SPARK-7466) DAG visualization: orphaned nodes are not rendered correctly
[ https://issues.apache.org/jira/browse/SPARK-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7466: --- Assignee: Andrew Or (was: Apache Spark) DAG visualization: orphaned nodes are not rendered correctly Key: SPARK-7466 URL: https://issues.apache.org/jira/browse/SPARK-7466 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Attachments: after.png, before.png If you have an RDD instantiated outside of a scope, it is rendered as a weird badge outside of a stage. This is because we keep the edge but do not inform dagre-d3 of the node, resulting in the library rendering the node for us without the expected styles and labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7466) DAG visualization: orphaned nodes are not rendered correctly
[ https://issues.apache.org/jira/browse/SPARK-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7466: --- Assignee: Apache Spark (was: Andrew Or) DAG visualization: orphaned nodes are not rendered correctly Key: SPARK-7466 URL: https://issues.apache.org/jira/browse/SPARK-7466 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Apache Spark Priority: Critical Attachments: after.png, before.png If you have an RDD instantiated outside of a scope, it is rendered as a weird badge outside of a stage. This is because we keep the edge but do not inform dagre-d3 of the node, resulting in the library rendering the node for us without the expected styles and labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7466) DAG visualization: orphaned nodes are not rendered correctly
[ https://issues.apache.org/jira/browse/SPARK-7466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534013#comment-14534013 ] Apache Spark commented on SPARK-7466: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/6002 DAG visualization: orphaned nodes are not rendered correctly Key: SPARK-7466 URL: https://issues.apache.org/jira/browse/SPARK-7466 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical Attachments: after.png, before.png If you have an RDD instantiated outside of a scope, it is rendered as a weird badge outside of a stage. This is because we keep the edge but do not inform dagre-d3 of the node, resulting in the library rendering the node for us without the expected styles and labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534078#comment-14534078 ] Tathagata Das commented on SPARK-6770: -- Was this problem solved? I think I discuss this explicitly in the Streaming guide here. http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations If this solves the issue, I am inclined to close this JIRA. Either way, this is not a problem with DirectKafkaInputDStream as this JIRA title seem to indicate. DirectKafkaInputDStream has not been initialized when recovery from checkpoint -- Key: SPARK-6770 URL: https://issues.apache.org/jira/browse/SPARK-6770 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: yangping wu I am read data from kafka using createDirectStream method and save the received log to Mysql, the code snippets as follows {code} def functionToCreateContext(): StreamingContext = { val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(10)) ssc.checkpoint(/tmp/kafka/channel/offset) // set checkpoint directory ssc } val struct = StructType(StructField(log, StringType) ::Nil) // Get StreamingContext from checkpoint data or create a new one val ssc = StreamingContext.getOrCreate(/tmp/kafka/channel/offset, functionToCreateContext) val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext) SDB.foreachRDD(rdd = { val result = rdd.map(item = { println(item) val result = item._2 match { case e: String = Row.apply(e) case _ = Row.apply() } result }) println(result.count()) val df = sqlContext.createDataFrame(result, struct) df.insertIntoJDBC(url, test, overwrite = false) }) ssc.start() ssc.awaitTermination() ssc.stop() {code} But when I recovery the program from checkpoint, I encountered an exception: {code} Exception in thread main org.apache.spark.SparkException: org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not been initialized at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266) at org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218) at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89) at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67) at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512) at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57) at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
[jira] [Created] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
Steve Loughran created SPARK-7481: - Summary: Add Hadoop 2.6+ profile to pull in object store FS accessors Key: SPARK-7481 URL: https://issues.apache.org/jira/browse/SPARK-7481 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1 Reporter: Steve Loughran To keep the s3n classpath right, to add s3a, swift azure, the dependencies of spark in a 2.6+ profile need to add the relevant object store packages (hadoop-aws, hadoop-openstack, hadoop-azure) this adds more stuff to the client bundle, but will mean a single spark package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6091) Add MulticlassMetrics in PySpark/MLlib
[ https://issues.apache.org/jira/browse/SPARK-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6091: - Target Version/s: 1.4.0 Add MulticlassMetrics in PySpark/MLlib -- Key: SPARK-6091 URL: https://issues.apache.org/jira/browse/SPARK-6091 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Yanbo Liang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6091) Add MulticlassMetrics in PySpark/MLlib
[ https://issues.apache.org/jira/browse/SPARK-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6091: - Assignee: Yanbo Liang Add MulticlassMetrics in PySpark/MLlib -- Key: SPARK-6091 URL: https://issues.apache.org/jira/browse/SPARK-6091 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Yanbo Liang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6092) Add RankingMetrics in PySpark/MLlib
[ https://issues.apache.org/jira/browse/SPARK-6092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6092: - Assignee: Yanbo Liang Add RankingMetrics in PySpark/MLlib --- Key: SPARK-6092 URL: https://issues.apache.org/jira/browse/SPARK-6092 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Yanbo Liang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6092) Add RankingMetrics in PySpark/MLlib
[ https://issues.apache.org/jira/browse/SPARK-6092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6092: - Target Version/s: 1.4.0 Add RankingMetrics in PySpark/MLlib --- Key: SPARK-6092 URL: https://issues.apache.org/jira/browse/SPARK-6092 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Yanbo Liang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7478: - Description: Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf was: Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. This can be solved by {{SQLContext.getOrCreate}} Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext --- Key: SPARK-7478 URL: https://issues.apache.org/jira/browse/SPARK-7478 Project: Spark Issue Type: New Feature Components: SQL Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7478: --- Assignee: Tathagata Das (was: Apache Spark) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext --- Key: SPARK-7478 URL: https://issues.apache.org/jira/browse/SPARK-7478 Project: Spark Issue Type: New Feature Components: SQL Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules
[ https://issues.apache.org/jira/browse/SPARK-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6889. -- Resolution: Fixed Fix Version/s: 1.4.0 Streamline contribution process with update to Contribution wiki, JIRA rules Key: SPARK-6889 URL: https://issues.apache.org/jira/browse/SPARK-6889 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Sean Owen Assignee: Sean Owen Fix For: 1.4.0 Attachments: ContributingtoSpark.pdf, SparkProjectMechanicsChallenges.pdf, faq.html.patch From about 6 months of intimate experience with the Spark JIRA and the reality of the JIRA / PR flow, I've observed some challenges, problems and growing pains that have begun to encumber the project mechanics. In the attached SparkProjectMechanicsChallenges.pdf document, I've collected these observations and a few statistics that summarize much of what I've seen. From side conversations with several of you, I think some of these will resonate. (Read it first for this to make sense.) I'd like to improve just one aspect to start: the contribution process. A lot of inbound contribution effort gets misdirected, and can burn a lot of cycles for everyone, and that's a barrier to scaling up further and to general happiness. I'd like to propose for discussion a change to the wiki pages, and a change to some JIRA settings. *Wiki* - Replace https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark with proposed text (NewContributingToSpark.pdf) - Delete https://cwiki.apache.org/confluence/display/SPARK/Reviewing+and+Merging+Patches as it is subsumed by the new text - Move the IDE Setup section to https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools - Delete https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+Scheme as it's a bit out of date and not all that useful *JIRA* Now: Start by removing everyone from the 'Developer' role and add them to 'Contributor'. Right now Developer has no permission that Contributor doesn't. We may reuse Developer later for some level between Committer and Contributor. Later, with Apache admin assistance: - Make Component and Affects Version required for new JIRAs - Set default priority to Minor and type to Question for new JIRAs. If defaults aren't changed, by default it can't be that important - Only let Committers set Target Version and Fix Version - Only let Committers set Blocker Priority -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534099#comment-14534099 ] Apache Spark commented on SPARK-7478: - User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/6006 Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext --- Key: SPARK-7478 URL: https://issues.apache.org/jira/browse/SPARK-7478 Project: Spark Issue Type: New Feature Components: SQL Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7479) SparkR can not work
Weizhong created SPARK-7479: --- Summary: SparkR can not work Key: SPARK-7479 URL: https://issues.apache.org/jira/browse/SPARK-7479 Project: Spark Issue Type: Bug Components: SparkR Reporter: Weizhong Priority: Minor I have build master branch Spark, and run SparkR, but it failed when i run pi.R. And the error info is: Error: could not find function parallelize But if I add the namespace in the code, for example SparkR:::parallelize then it work correctly. My cluster info: JDK: 1.8.0_40 Hadoop: 2.6.0 R: 3.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7479) SparkR can not work
[ https://issues.apache.org/jira/browse/SPARK-7479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7479. -- Resolution: Invalid This kind of thing should begin as a question at user@ as I suspect it is a basic problem with your env. SparkR can not work --- Key: SPARK-7479 URL: https://issues.apache.org/jira/browse/SPARK-7479 Project: Spark Issue Type: Bug Components: SparkR Reporter: Weizhong Priority: Minor I have build master branch Spark, and run SparkR, but it failed when i run pi.R. And the error info is: Error: could not find function parallelize But if I add the namespace in the code, for example SparkR:::parallelize then it work correctly. My cluster info: JDK: 1.8.0_40 Hadoop: 2.6.0 R: 3.2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534168#comment-14534168 ] Sean Owen commented on SPARK-7481: -- Yikes, that seems like a load of stuff to pull in. Can't this / shouldn't this be added by the end user if desired? Add Hadoop 2.6+ profile to pull in object store FS accessors Key: SPARK-7481 URL: https://issues.apache.org/jira/browse/SPARK-7481 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1 Reporter: Steve Loughran To keep the s3n classpath right, to add s3a, swift azure, the dependencies of spark in a 2.6+ profile need to add the relevant object store packages (hadoop-aws, hadoop-openstack, hadoop-azure) this adds more stuff to the client bundle, but will mean a single spark package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7478: - Description: Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line val sqlContext = new SQLContext multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. was: Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line val sqlContext = new SQLContext multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770](https://issues.apache.org/jira/browse/SPARK-6770) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext --- Key: SPARK-7478 URL: https://issues.apache.org/jira/browse/SPARK-7478 Project: Spark Issue Type: New Feature Components: SQL Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line val sqlContext = new SQLContext multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5034) Spark on Yarn launch failure on HDInsight on Windows
[ https://issues.apache.org/jira/browse/SPARK-5034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5034. -- Resolution: Cannot Reproduce I don't know what to make of this without more info. I don't think it is a parsing or quoting issue as it just looks like the main class name is incorrect and overwritten in part by some host name. Spark on Yarn launch failure on HDInsight on Windows Key: SPARK-5034 URL: https://issues.apache.org/jira/browse/SPARK-5034 Project: Spark Issue Type: Bug Components: Windows, YARN Affects Versions: 1.1.0, 1.1.1, 1.2.0 Environment: Spark on Yarn within HDInsight on Windows Azure Reporter: Rice Windows Environment I'm trying to run JavaSparkPi example on YARN with master = yarn-client but I have a problem. It runs smoothly with submitting application, first container for Application Master works too. When job is starting and there are some tasks to do I'm getting this warning on console (I'm using windows cmd if this makes any difference): WARN cluster.YarnClientClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory When I'm checking logs for container with Application Masters it is launching containers for executors properly, then goes with: INFO YarnAllocationHandler: Completed container container_1409217202587_0003_01_02 (state: COMPLETE, exit status: 1) INFO YarnAllocationHandler: Container marked as failed: container_1409217202587_0003_01_02 And tries to re-launch them. On failed container log there is only this: Error: Could not find or load main class pwd..sp...@gbv06758291.my.secret.address.net:63680.user.CoarseGrainedScheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7478: - Description: Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line val sqlContext = new SQLContext multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770](https://issues.apache.org/jira/browse/SPARK-6770) was: Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line val sqlContext = new SQLContext multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext --- Key: SPARK-7478 URL: https://issues.apache.org/jira/browse/SPARK-7478 Project: Spark Issue Type: New Feature Components: SQL Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line val sqlContext = new SQLContext multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770](https://issues.apache.org/jira/browse/SPARK-6770) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7478: - Description: Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. This can be solved by {{SQLContext.getOrCreate}} was: Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {val sqlContext = new SQLContext} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. This can be solved by {SQLContext.getOrCreate} Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext --- Key: SPARK-7478 URL: https://issues.apache.org/jira/browse/SPARK-7478 Project: Spark Issue Type: New Feature Components: SQL Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. This can be solved by {{SQLContext.getOrCreate}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7478: --- Assignee: Apache Spark (was: Tathagata Das) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext --- Key: SPARK-7478 URL: https://issues.apache.org/jira/browse/SPARK-7478 Project: Spark Issue Type: New Feature Components: SQL Reporter: Tathagata Das Assignee: Apache Spark Priority: Blocker Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6876) DataFrame.na.replace value support for Python
[ https://issues.apache.org/jira/browse/SPARK-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534026#comment-14534026 ] Apache Spark commented on SPARK-6876: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/6003 DataFrame.na.replace value support for Python - Key: SPARK-6876 URL: https://issues.apache.org/jira/browse/SPARK-6876 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Scala/Java support is in. We should provide the Python version, similar to what Pandas supports. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7467) DAG visualization: handle checkpoint correctly
[ https://issues.apache.org/jira/browse/SPARK-7467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7467: --- Assignee: Apache Spark (was: Andrew Or) DAG visualization: handle checkpoint correctly -- Key: SPARK-7467 URL: https://issues.apache.org/jira/browse/SPARK-7467 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Apache Spark We need to wrap RDD#doCheckpoint in a scope. Otherwise CheckpointRDDs may belong to other operators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7467) DAG visualization: handle checkpoint correctly
[ https://issues.apache.org/jira/browse/SPARK-7467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7467: --- Assignee: Andrew Or (was: Apache Spark) DAG visualization: handle checkpoint correctly -- Key: SPARK-7467 URL: https://issues.apache.org/jira/browse/SPARK-7467 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or We need to wrap RDD#doCheckpoint in a scope. Otherwise CheckpointRDDs may belong to other operators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7467) DAG visualization: handle checkpoint correctly
[ https://issues.apache.org/jira/browse/SPARK-7467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534032#comment-14534032 ] Apache Spark commented on SPARK-7467: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/6004 DAG visualization: handle checkpoint correctly -- Key: SPARK-7467 URL: https://issues.apache.org/jira/browse/SPARK-7467 Project: Spark Issue Type: Sub-task Components: Web UI Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or We need to wrap RDD#doCheckpoint in a scope. Otherwise CheckpointRDDs may belong to other operators. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534094#comment-14534094 ] Tathagata Das edited comment on SPARK-7478 at 5/8/15 8:24 AM: -- [~rxin] [~marmbrus] Thoughts? was (Author: tdas): [~rxin][~marmbrus] Thoughts? Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext --- Key: SPARK-7478 URL: https://issues.apache.org/jira/browse/SPARK-7478 Project: Spark Issue Type: New Feature Components: SQL Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534094#comment-14534094 ] Tathagata Das edited comment on SPARK-7478 at 5/8/15 8:24 AM: -- [~rxin][~marmbrus] Thoughts? was (Author: tdas): [~rxin] Thoughts? Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext --- Key: SPARK-7478 URL: https://issues.apache.org/jira/browse/SPARK-7478 Project: Spark Issue Type: New Feature Components: SQL Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534094#comment-14534094 ] Tathagata Das commented on SPARK-7478: -- [~rxin] Thoughts? Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext --- Key: SPARK-7478 URL: https://issues.apache.org/jira/browse/SPARK-7478 Project: Spark Issue Type: New Feature Components: SQL Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6876) DataFrame.na.replace value support for Python
[ https://issues.apache.org/jira/browse/SPARK-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6876: --- Assignee: Apache Spark DataFrame.na.replace value support for Python - Key: SPARK-6876 URL: https://issues.apache.org/jira/browse/SPARK-6876 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Scala/Java support is in. We should provide the Python version, similar to what Pandas supports. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6876) DataFrame.na.replace value support for Python
[ https://issues.apache.org/jira/browse/SPARK-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6876: --- Assignee: (was: Apache Spark) DataFrame.na.replace value support for Python - Key: SPARK-6876 URL: https://issues.apache.org/jira/browse/SPARK-6876 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Scala/Java support is in. We should provide the Python version, similar to what Pandas supports. http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7231) Make SparkR DataFrame API more dplyr friendly
[ https://issues.apache.org/jira/browse/SPARK-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7231: --- Assignee: Apache Spark (was: Shivaram Venkataraman) Make SparkR DataFrame API more dplyr friendly - Key: SPARK-7231 URL: https://issues.apache.org/jira/browse/SPARK-7231 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Apache Spark Priority: Critical This ticket tracks auditing the SparkR dataframe API and ensuring that the API is friendly to existing R users. Mainly we wish to make sure the DataFrame API we expose has functions similar to those which exist on native R data frames and in popular packages like `dplyr`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7231) Make SparkR DataFrame API more dplyr friendly
[ https://issues.apache.org/jira/browse/SPARK-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534034#comment-14534034 ] Apache Spark commented on SPARK-7231: - User 'shivaram' has created a pull request for this issue: https://github.com/apache/spark/pull/6005 Make SparkR DataFrame API more dplyr friendly - Key: SPARK-7231 URL: https://issues.apache.org/jira/browse/SPARK-7231 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical This ticket tracks auditing the SparkR dataframe API and ensuring that the API is friendly to existing R users. Mainly we wish to make sure the DataFrame API we expose has functions similar to those which exist on native R data frames and in popular packages like `dplyr`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1423) Add scripts for launching Spark on Windows Azure
[ https://issues.apache.org/jira/browse/SPARK-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1423. -- Resolution: Won't Fix Given the lack of activity and resolution of https://issues.apache.org/jira/browse/SPARK-1422 I think that's probably correct. Add scripts for launching Spark on Windows Azure Key: SPARK-1423 URL: https://issues.apache.org/jira/browse/SPARK-1423 Project: Spark Issue Type: Improvement Components: Windows Reporter: Matei Zaharia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7392) Kryo buffer size can not be larger than 2M
[ https://issues.apache.org/jira/browse/SPARK-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7392. -- Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Zhang, Liye Resolved by https://github.com/apache/spark/pull/5934 Kryo buffer size can not be larger than 2M -- Key: SPARK-7392 URL: https://issues.apache.org/jira/browse/SPARK-7392 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Zhang, Liye Assignee: Zhang, Liye Priority: Critical Fix For: 1.4.0 when set *spark.kryoserializer.buffer* larger than 2048k, *IllegalArgumentException* will be thrown. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
Tathagata Das created SPARK-7478: Summary: Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext Key: SPARK-7478 URL: https://issues.apache.org/jira/browse/SPARK-7478 Project: Spark Issue Type: New Feature Components: SQL Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line val sqlContext = new SQLContext multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7478) Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-7478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-7478: - Description: Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf was: Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf Add a SQLContext.getOrCreate to maintain a singleton instance of SQLContext --- Key: SPARK-7478 URL: https://issues.apache.org/jira/browse/SPARK-7478 Project: Spark Issue Type: New Feature Components: SQL Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker Having a SQLContext singleton would make it easier for applications to use a lazily instantiated single shared instance of SQLContext when needed. It would avoid problems like 1. In REPL/notebook environment, rerunning the line {{val sqlContext = new SQLContext}} multiple times created different contexts while overriding the reference to previous context, leading to issues like registered temp tables going missing. 2. In Streaming, creating SQLContext directly leads to serialization/deserialization issues when attempting to recover from DStream checkpoints. See [SPARK-6770]. Also to get around this problem I had to suggest creating a singleton instance - https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/SqlNetworkWordCount.scala This can be solved by {{SQLContext.getOrCreate}} which get or creates a new singleton instance of SQLContext using either a given SparkContext or a given SparkConf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7480) Get exception when DataFrame saveAsTable and run sql on the same table at the same time
pin_zhang created SPARK-7480: Summary: Get exception when DataFrame saveAsTable and run sql on the same table at the same time Key: SPARK-7480 URL: https://issues.apache.org/jira/browse/SPARK-7480 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1, 1.3.0 Reporter: pin_zhang There is a case 1) In the main thread call DataFrame.saveAsTable(table, SaveMode.Overwrite); save json rdd to hive table 2) In another thread run sql the table simultaneously You can see many exceptions to indicate the table not exit or table is not complete. Does Spark SQL support such usage? Thanks [Main Thread] DataFrame df = hiveContext_.jsonFile(test.json); String table = UNIT_TEST; while (true) { df = hiveContext_.jsonFile(test.json); df.saveAsTable(table, SaveMode.Overwrite); System.out.println(new Timestamp(System.currentTimeMillis()) + [ +Thread.currentThread().getName() + ] override table); try { Thread.sleep(3000); } catch (InterruptedException e) { e.printStackTrace(); } } [Query Thread] DataFrame query = hiveContext_.sql(select * from UNIT_TEST); Row[] rows = query.collect(); System.out.println(new Timestamp(System.currentTimeMillis()) + [ + Thread.currentThread().getName() + ] [query result count:] + rows.length); [Exceptions in log] 15/05/08 16:05:49 ERROR Hive: NoSuchObjectException(message:default.unit_test table not found) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_table(HiveMetaStore.java:1560) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105) at com.sun.proxy.$Proxy20.get_table(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997) at sun.reflect.GeneratedMethodAccessor23.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy21.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:950) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:201) at org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:262) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:161) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:161) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:161) at org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:262) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:174) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:186) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$6.applyOrElse(Analyzer.scala:181) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:188) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:208) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534174#comment-14534174 ] Steve Loughran commented on SPARK-7481: --- This doesn't contain any endorsement of the use of s3a in Hadoop 2.6; see HADOOP-11571 I'm not planning to add any tests for this, but its something to consider for regression testing all the object stores —the tests just need to: * be skipped if there's no credentials * make a best effort to stop anyone accidentally checking in their credentials * work on deskop/jenkins rather than just on cloud. * not run up massive bills * not take forever AWS publishes some free-to-read datasets, such as [this one|http://datasets.elasticmapreduce.s3.amazonaws.com/] which won't need credentials, work remote and don't ring up bills for the read part of the process, but would take a long time to complete on a single executor. Add Hadoop 2.6+ profile to pull in object store FS accessors Key: SPARK-7481 URL: https://issues.apache.org/jira/browse/SPARK-7481 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1 Reporter: Steve Loughran To keep the s3n classpath right, to add s3a, swift azure, the dependencies of spark in a 2.6+ profile need to add the relevant object store packages (hadoop-aws, hadoop-openstack, hadoop-azure) this adds more stuff to the client bundle, but will mean a single spark package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534209#comment-14534209 ] yangping wu commented on SPARK-6770: Hi [~tdas], I use the code you mentioned, It was successes recovery from checkpoint. It solves the issue, Thank you. DirectKafkaInputDStream has not been initialized when recovery from checkpoint -- Key: SPARK-6770 URL: https://issues.apache.org/jira/browse/SPARK-6770 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: yangping wu I am read data from kafka using createDirectStream method and save the received log to Mysql, the code snippets as follows {code} def functionToCreateContext(): StreamingContext = { val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(10)) ssc.checkpoint(/tmp/kafka/channel/offset) // set checkpoint directory ssc } val struct = StructType(StructField(log, StringType) ::Nil) // Get StreamingContext from checkpoint data or create a new one val ssc = StreamingContext.getOrCreate(/tmp/kafka/channel/offset, functionToCreateContext) val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext) SDB.foreachRDD(rdd = { val result = rdd.map(item = { println(item) val result = item._2 match { case e: String = Row.apply(e) case _ = Row.apply() } result }) println(result.count()) val df = sqlContext.createDataFrame(result, struct) df.insertIntoJDBC(url, test, overwrite = false) }) ssc.start() ssc.awaitTermination() ssc.stop() {code} But when I recovery the program from checkpoint, I encountered an exception: {code} Exception in thread main org.apache.spark.SparkException: org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not been initialized at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266) at org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218) at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89) at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67) at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512) at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57) at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534210#comment-14534210 ] Sean Owen commented on SPARK-7481: -- Maybe I'd be less frightened if I knew the size of these deps and their dependencies was small, and the licenses were all OK, etc. This would need some checking; I know we had a license problem and so forth with Kinesis, and have had jets3t problems, etc. I am maybe needlessly wary of doing this several times over to add more niche FS clients to the main build for everyone. Add Hadoop 2.6+ profile to pull in object store FS accessors Key: SPARK-7481 URL: https://issues.apache.org/jira/browse/SPARK-7481 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1 Reporter: Steve Loughran To keep the s3n classpath right, to add s3a, swift azure, the dependencies of spark in a 2.6+ profile need to add the relevant object store packages (hadoop-aws, hadoop-openstack, hadoop-azure) this adds more stuff to the client bundle, but will mean a single spark package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534240#comment-14534240 ] Tathagata Das commented on SPARK-6770: -- Awesome! I am closing this JIRA then! DirectKafkaInputDStream has not been initialized when recovery from checkpoint -- Key: SPARK-6770 URL: https://issues.apache.org/jira/browse/SPARK-6770 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: yangping wu I am read data from kafka using createDirectStream method and save the received log to Mysql, the code snippets as follows {code} def functionToCreateContext(): StreamingContext = { val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(10)) ssc.checkpoint(/tmp/kafka/channel/offset) // set checkpoint directory ssc } val struct = StructType(StructField(log, StringType) ::Nil) // Get StreamingContext from checkpoint data or create a new one val ssc = StreamingContext.getOrCreate(/tmp/kafka/channel/offset, functionToCreateContext) val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext) SDB.foreachRDD(rdd = { val result = rdd.map(item = { println(item) val result = item._2 match { case e: String = Row.apply(e) case _ = Row.apply() } result }) println(result.count()) val df = sqlContext.createDataFrame(result, struct) df.insertIntoJDBC(url, test, overwrite = false) }) ssc.start() ssc.awaitTermination() ssc.stop() {code} But when I recovery the program from checkpoint, I encountered an exception: {code} Exception in thread main org.apache.spark.SparkException: org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not been initialized at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266) at org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218) at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89) at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67) at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512) at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57) at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at
[jira] [Commented] (SPARK-7459) Add Java example for ElementwiseProduct in programming guide
[ https://issues.apache.org/jira/browse/SPARK-7459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534276#comment-14534276 ] Octavian Geagla commented on SPARK-7459: Can do! Add Java example for ElementwiseProduct in programming guide Key: SPARK-7459 URL: https://issues.apache.org/jira/browse/SPARK-7459 Project: Spark Issue Type: Documentation Components: Documentation, Java API, ML Reporter: Joseph K. Bradley Priority: Minor Duplicate Scala example, but in Java. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das closed SPARK-6770. Resolution: Not A Problem DirectKafkaInputDStream has not been initialized when recovery from checkpoint -- Key: SPARK-6770 URL: https://issues.apache.org/jira/browse/SPARK-6770 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: yangping wu I am read data from kafka using createDirectStream method and save the received log to Mysql, the code snippets as follows {code} def functionToCreateContext(): StreamingContext = { val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(10)) ssc.checkpoint(/tmp/kafka/channel/offset) // set checkpoint directory ssc } val struct = StructType(StructField(log, StringType) ::Nil) // Get StreamingContext from checkpoint data or create a new one val ssc = StreamingContext.getOrCreate(/tmp/kafka/channel/offset, functionToCreateContext) val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext) SDB.foreachRDD(rdd = { val result = rdd.map(item = { println(item) val result = item._2 match { case e: String = Row.apply(e) case _ = Row.apply() } result }) println(result.count()) val df = sqlContext.createDataFrame(result, struct) df.insertIntoJDBC(url, test, overwrite = false) }) ssc.start() ssc.awaitTermination() ssc.stop() {code} But when I recovery the program from checkpoint, I encountered an exception: {code} Exception in thread main org.apache.spark.SparkException: org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not been initialized at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266) at org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218) at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89) at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67) at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512) at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57) at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at
[jira] [Assigned] (SPARK-7482) Rename some DataFrame API methods in SparkR to match their counterparts in Scala
[ https://issues.apache.org/jira/browse/SPARK-7482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7482: --- Assignee: (was: Apache Spark) Rename some DataFrame API methods in SparkR to match their counterparts in Scala Key: SPARK-7482 URL: https://issues.apache.org/jira/browse/SPARK-7482 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Sun Rui Priority: Critical This is re-consideration on how to solve name conflict. Previously, we rename API names from Scala API if there is name conflict with base or other commonly-used packages. However, from long term perspective, this is not good for API stability, because we can't predict name conflicts, for example, if in the future a name added in base package conflicts with an API in SparkR? So the better policy is to keep API name same as Scala's without worrying about name conflicts. When users use SparkR, they should load SparkR as last package, so that all API names are effective. Use can explicitly use :: to refer to hidden names from other packages. more discussion can be found at https://issues.apache.org/jira/browse/SPARK-6812 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7482) Rename some DataFrame API methods in SparkR to match their counterparts in Scala
[ https://issues.apache.org/jira/browse/SPARK-7482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534255#comment-14534255 ] Apache Spark commented on SPARK-7482: - User 'sun-rui' has created a pull request for this issue: https://github.com/apache/spark/pull/6007 Rename some DataFrame API methods in SparkR to match their counterparts in Scala Key: SPARK-7482 URL: https://issues.apache.org/jira/browse/SPARK-7482 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Sun Rui Priority: Critical This is re-consideration on how to solve name conflict. Previously, we rename API names from Scala API if there is name conflict with base or other commonly-used packages. However, from long term perspective, this is not good for API stability, because we can't predict name conflicts, for example, if in the future a name added in base package conflicts with an API in SparkR? So the better policy is to keep API name same as Scala's without worrying about name conflicts. When users use SparkR, they should load SparkR as last package, so that all API names are effective. Use can explicitly use :: to refer to hidden names from other packages. more discussion can be found at https://issues.apache.org/jira/browse/SPARK-6812 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7482) Rename some DataFrame API methods in SparkR to match their counterparts in Scala
[ https://issues.apache.org/jira/browse/SPARK-7482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7482: --- Assignee: Apache Spark Rename some DataFrame API methods in SparkR to match their counterparts in Scala Key: SPARK-7482 URL: https://issues.apache.org/jira/browse/SPARK-7482 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Sun Rui Assignee: Apache Spark Priority: Critical This is re-consideration on how to solve name conflict. Previously, we rename API names from Scala API if there is name conflict with base or other commonly-used packages. However, from long term perspective, this is not good for API stability, because we can't predict name conflicts, for example, if in the future a name added in base package conflicts with an API in SparkR? So the better policy is to keep API name same as Scala's without worrying about name conflicts. When users use SparkR, they should load SparkR as last package, so that all API names are effective. Use can explicitly use :: to refer to hidden names from other packages. more discussion can be found at https://issues.apache.org/jira/browse/SPARK-6812 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7459) Add Java example for ElementwiseProduct in programming guide
[ https://issues.apache.org/jira/browse/SPARK-7459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534276#comment-14534276 ] Octavian Geagla edited comment on SPARK-7459 at 5/8/15 10:26 AM: - Can do! Please assign to me. was (Author: ogeagla): Can do! Add Java example for ElementwiseProduct in programming guide Key: SPARK-7459 URL: https://issues.apache.org/jira/browse/SPARK-7459 Project: Spark Issue Type: Documentation Components: Documentation, Java API, ML Reporter: Joseph K. Bradley Priority: Minor Duplicate Scala example, but in Java. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7459) Add Java example for ElementwiseProduct in programming guide
[ https://issues.apache.org/jira/browse/SPARK-7459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534303#comment-14534303 ] Sean Owen commented on SPARK-7459: -- You dont need to be assigned, just go ahead. Add Java example for ElementwiseProduct in programming guide Key: SPARK-7459 URL: https://issues.apache.org/jira/browse/SPARK-7459 Project: Spark Issue Type: Documentation Components: Documentation, Java API, ML Reporter: Joseph K. Bradley Priority: Minor Duplicate Scala example, but in Java. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6154) Support Kafka, JDBC in Scala 2.11
[ https://issues.apache.org/jira/browse/SPARK-6154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534344#comment-14534344 ] Jianshi Huang commented on SPARK-6154: -- Do you mean we need to upgrade the jline version for both 2.11 and 2.10? Jianshi Support Kafka, JDBC in Scala 2.11 - Key: SPARK-6154 URL: https://issues.apache.org/jira/browse/SPARK-6154 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.0 Reporter: Jianshi Huang Build v1.3.0-rc2 with Scala 2.11 using instructions in the documentation failed when -Phive-thriftserver is enabled. [info] Compiling 9 Scala sources to /home/hjs/workspace/spark/sql/hive-thriftserver/target/scala-2.11/classes... [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:2 5: object ConsoleReader is not a member of package jline [error] import jline.{ConsoleReader, History} [error]^ [warn] Class jline.Completor not found - continuing with a stub. [warn] Class jline.ConsoleReader not found - continuing with a stub. [error] /home/hjs/workspace/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:1 65: not found: type ConsoleReader [error] val reader = new ConsoleReader() Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7482) Rename some DataFrame API methods in SparkR to match their counterparts in Scala
Sun Rui created SPARK-7482: -- Summary: Rename some DataFrame API methods in SparkR to match their counterparts in Scala Key: SPARK-7482 URL: https://issues.apache.org/jira/browse/SPARK-7482 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Sun Rui Priority: Critical This is re-consideration on how to solve name conflict. Previously, we rename API names from Scala API if there is name conflict with base or other commonly-used packages. However, from long term perspective, this is not good for API stability, because we can't predict name conflicts, for example, if in the future a name added in base package conflicts with an API in SparkR? So the better policy is to keep API name same as Scala's without worrying about name conflicts. When users use SparkR, they should load SparkR as last package, so that all API names are effective. Use can explicitly use :: to refer to hidden names from other packages. more discussion can be found at https://issues.apache.org/jira/browse/SPARK-6812 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7459) Add Java example for ElementwiseProduct in programming guide
[ https://issues.apache.org/jira/browse/SPARK-7459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7459: - Assignee: Octavian Geagla Add Java example for ElementwiseProduct in programming guide Key: SPARK-7459 URL: https://issues.apache.org/jira/browse/SPARK-7459 Project: Spark Issue Type: Documentation Components: Documentation, Java API, ML Reporter: Joseph K. Bradley Assignee: Octavian Geagla Priority: Minor Duplicate Scala example, but in Java. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6869) Add pyspark archives path to PYTHONPATH
[ https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-6869: - Priority: Blocker (was: Minor) Add pyspark archives path to PYTHONPATH --- Key: SPARK-6869 URL: https://issues.apache.org/jira/browse/SPARK-6869 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Weizhong Priority: Blocker From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the assembly jar are package by JDK 1.7+, so ship pyspark archives to executors by Yarn with --py-files. The pyspark archives name must contains spark-pyspark. 1st: zip pyspark to spark-pyspark_2.10.zip 2nd:./bin/spark-submit --master yarn-client/yarn-cluster --py-files spark-pyspark_2.10.zip app.py args -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7449) createPhysicalRDD should use RDD output as schema instead of relation.schema
[ https://issues.apache.org/jira/browse/SPARK-7449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7449: - Component/s: SQL createPhysicalRDD should use RDD output as schema instead of relation.schema Key: SPARK-7449 URL: https://issues.apache.org/jira/browse/SPARK-7449 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhan Zhang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7483) [MLLib] Using Kryo with FPGrowth fails with an exception
[ https://issues.apache.org/jira/browse/SPARK-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7483: - Component/s: MLlib Priority: Minor (was: Major) [MLLib] Using Kryo with FPGrowth fails with an exception Key: SPARK-7483 URL: https://issues.apache.org/jira/browse/SPARK-7483 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Tomasz Bartczak Priority: Minor When using FPGrowth algorithm with KryoSerializer - Spark fails with {code} Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most recent failure: Lost task 0.0 in stage 9.0 (TID 16, localhost): com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: Can not set final scala.collection.mutable.ListBuffer field org.apache.spark.mllib.fpm.FPTree$Summary.nodes to scala.collection.mutable.ArrayBuffer Serialization trace: nodes (org.apache.spark.mllib.fpm.FPTree$Summary) org$apache$spark$mllib$fpm$FPTree$$summaries (org.apache.spark.mllib.fpm.FPTree) {code} This can be easily reproduced in spark codebase by setting {code} conf.set(spark.serializer, org.apache.spark.serializer.KryoSerializer) {code} and running FPGrowthSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7484) Support passing jdbc connection properties for dataframe.createJDBCTable and insertIntoJDBC
[ https://issues.apache.org/jira/browse/SPARK-7484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-7484: --- Issue Type: Improvement (was: Bug) Support passing jdbc connection properties for dataframe.createJDBCTable and insertIntoJDBC --- Key: SPARK-7484 URL: https://issues.apache.org/jira/browse/SPARK-7484 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Venkata Ramana G Priority: Minor Few jdbc drivers like SybaseIQ support passing username and password only through connection properties. So the same needs to be supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7435) Make DataFrame.show() consistent with that of Scala and pySpark
[ https://issues.apache.org/jira/browse/SPARK-7435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-7435: --- Priority: Critical (was: Blocker) Make DataFrame.show() consistent with that of Scala and pySpark --- Key: SPARK-7435 URL: https://issues.apache.org/jira/browse/SPARK-7435 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Sun Rui Priority: Critical Currently in SparkR, DataFrame has two methods show() and showDF(). show() prints the DataFrame column names and types and showDF() prints the first numRows rows of a DataFrame. In Scala and pySpark, show() is used to prints rows of a DataFrame. We'd better keep API consistent unless there is some important reason. So propose to interchange the names (show() and showDF()) in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7486) Add the streaming implementation for estimating quantiles and median
Liang-Chi Hsieh created SPARK-7486: -- Summary: Add the streaming implementation for estimating quantiles and median Key: SPARK-7486 URL: https://issues.apache.org/jira/browse/SPARK-7486 Project: Spark Issue Type: New Feature Components: ML, SQL Reporter: Liang-Chi Hsieh Streaming implementations that can estimate quantiles, median are very useful for ML algorithm and data statistics. Apache DataFu Pig has this kind of implementation. We can port it to Spark. Please refer to: http://datafu.incubator.apache.org/docs/datafu/getting-started.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7110) when use saveAsNewAPIHadoopFile, sometimes it throws Delegation Token can be issued only with kerberos or web authentication
[ https://issues.apache.org/jira/browse/SPARK-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534494#comment-14534494 ] Thomas Graves commented on SPARK-7110: -- [~gu chi] is there some of the stack trace missing from the description? If so could you attach the rest of it? Could you also provide the context in which you are NewHadoopRDD.getPartitions is called? Are you calling it directly or is it being called from another Spark routine? (if so which interface) when use saveAsNewAPIHadoopFile, sometimes it throws Delegation Token can be issued only with kerberos or web authentication -- Key: SPARK-7110 URL: https://issues.apache.org/jira/browse/SPARK-7110 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: gu-chi Under yarn-client mode, this issue random occurs. Authentication method is set to kerberos, and use saveAsNewAPIHadoopFile in PairRDDFunctions to save data to HDFS, then exception comes as: org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token can be issued only with kerberos or web authentication -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7393) How to improve Spark SQL performance?
[ https://issues.apache.org/jira/browse/SPARK-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-7393. Resolution: Invalid Hi - thanks for giving feedback on your use of Spark SQL. This type of discussions should take place on the mailing list rather than our feature issue tracker. How to improve Spark SQL performance? - Key: SPARK-7393 URL: https://issues.apache.org/jira/browse/SPARK-7393 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang Lee We want to use Spark SQL in our project ,but we found that the Spark SQL performance is not very well as we expected. The detail is as follows: 1. We save data as parquet file on HDFS. 2.We just select one or several rows from the parquet file using spark SQL. 3. When the total record number is 61 million, it needs about 3 seconds to get the result, which is unacceptable long for our scenario. 4.When the total record number is 2 million, it needs about 93 ms to get the result, whcih is still a little long for us. 5. The query statement is like : SELECT * FROM DBA WHERE COLA=? AND COLB=? And the table is not complex, which has less 10 columns and the content for each column is less than 100 bytes. 6. Does any one know how to improve the performance or give some other ideas? 7. Can Spark SQL support micro-second-level response? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7485) Remove python artifacts from the assembly jar
Thomas Graves created SPARK-7485: Summary: Remove python artifacts from the assembly jar Key: SPARK-7485 URL: https://issues.apache.org/jira/browse/SPARK-7485 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Reporter: Thomas Graves We change it so that we distributed the python files via a zip file in SPARK-6869. With that we should remove the python files from the assembly jar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3928) Support wildcard matches on Parquet files
[ https://issues.apache.org/jira/browse/SPARK-3928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534582#comment-14534582 ] Thu Kyaw commented on SPARK-3928: - Hello [~lian cheng] please let me know if you want me to work on adding back the glob support. Support wildcard matches on Parquet files - Key: SPARK-3928 URL: https://issues.apache.org/jira/browse/SPARK-3928 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Reporter: Nicholas Chammas Assignee: Cheng Lian Priority: Minor Fix For: 1.3.0 {{SparkContext.textFile()}} supports patterns like {{part-*}} and {{2014-\?\?-\?\?}}. It would be nice if {{SparkContext.parquetFile()}} did the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534629#comment-14534629 ] Rangarajan Sreenivasan commented on SPARK-5928: --- We are hitting a very similar issue. Job fails during the repartition stage. * Ours is a 10-node r3.8x cluster (119 GB 16-CPU per node) * Running Spark version 1.3.1 in Standalone cluster mode * Tried various parallelism values - 50, 100, 200, 500, 800 Remote Shuffle Blocks cannot be more than 2 GB -- Key: SPARK-5928 URL: https://issues.apache.org/jira/browse/SPARK-5928 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid If a shuffle block is over 2GB, the shuffle fails, with an uninformative exception. The tasks get retried a few times and then eventually the job fails. Here is an example program which can cause the exception: {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} Note that you can't trigger this exception in local mode, it only happens on remote fetches. I triggered these exceptions running with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {noformat} 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at
[jira] [Resolved] (SPARK-1920) Spark JAR compiled with Java 7 leads to PySpark not working in YARN
[ https://issues.apache.org/jira/browse/SPARK-1920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-1920. -- Resolution: Duplicate Spark JAR compiled with Java 7 leads to PySpark not working in YARN --- Key: SPARK-1920 URL: https://issues.apache.org/jira/browse/SPARK-1920 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.0.0 Reporter: Tathagata Das Priority: Blocker Current (Spark 1.0) implementation of PySpark on Yarn requires python to be able to read Spark assembly JAR. But Spark assembly JAR compiled with Java 7 can sometimes be not readable by python. This can be due to the fact that JARs created by Java 7 with more 2^16 files is encoded in Zip64, which python cant read. [SPARK-1911|https://issues.apache.org/jira/browse/SPARK-1911] warns users from using Java 7 when creating Spark distribution. One way to fix this is to put pyspark in a different smaller JAR than rest of Spark so that it is readable by python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6869) Add pyspark archives path to PYTHONPATH
[ https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-6869: - Assignee: Lianhui Wang Add pyspark archives path to PYTHONPATH --- Key: SPARK-6869 URL: https://issues.apache.org/jira/browse/SPARK-6869 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Weizhong Assignee: Lianhui Wang Priority: Blocker Fix For: 1.4.0 From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the assembly jar are package by JDK 1.7+, so ship pyspark archives to executors by Yarn with --py-files. The pyspark archives name must contains spark-pyspark. 1st: zip pyspark to spark-pyspark_2.10.zip 2nd:./bin/spark-submit --master yarn-client/yarn-cluster --py-files spark-pyspark_2.10.zip app.py args -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6869) Add pyspark archives path to PYTHONPATH
[ https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-6869. -- Resolution: Fixed Fix Version/s: 1.4.0 Add pyspark archives path to PYTHONPATH --- Key: SPARK-6869 URL: https://issues.apache.org/jira/browse/SPARK-6869 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Weizhong Priority: Blocker Fix For: 1.4.0 From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the assembly jar are package by JDK 1.7+, so ship pyspark archives to executors by Yarn with --py-files. The pyspark archives name must contains spark-pyspark. 1st: zip pyspark to spark-pyspark_2.10.zip 2nd:./bin/spark-submit --master yarn-client/yarn-cluster --py-files spark-pyspark_2.10.zip app.py args -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6869) Add pyspark archives path to PYTHONPATH
[ https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-6869: - Target Version/s: 1.4.0 Add pyspark archives path to PYTHONPATH --- Key: SPARK-6869 URL: https://issues.apache.org/jira/browse/SPARK-6869 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Weizhong Priority: Blocker Fix For: 1.4.0 From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the assembly jar are package by JDK 1.7+, so ship pyspark archives to executors by Yarn with --py-files. The pyspark archives name must contains spark-pyspark. 1st: zip pyspark to spark-pyspark_2.10.zip 2nd:./bin/spark-submit --master yarn-client/yarn-cluster --py-files spark-pyspark_2.10.zip app.py args -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6961) Cannot save data to parquet files when executing from Windows from a Maven Project
[ https://issues.apache.org/jira/browse/SPARK-6961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6961: --- Priority: Critical (was: Blocker) Cannot save data to parquet files when executing from Windows from a Maven Project -- Key: SPARK-6961 URL: https://issues.apache.org/jira/browse/SPARK-6961 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Bogdan Niculescu Priority: Critical I have setup a project where I am trying to save a DataFrame into a parquet file. My project is a Maven one with Spark 1.3.0 and Scala 2.11.5 : {code:xml} spark.version1.3.0/spark.version dependency groupIdorg.apache.spark/groupId artifactIdspark-core_2.11/artifactId version${spark.version}/version /dependency dependency groupIdorg.apache.spark/groupId artifactIdspark-sql_2.11/artifactId version${spark.version}/version /dependency {code} A simple version of my code that reproduces consistently the problem that I am seeing is : {code} import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} case class Person(name: String, age: Int) object DataFrameTest extends App { val conf = new SparkConf().setMaster(local[4]).setAppName(DataFrameTest) val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val persons = List(Person(a, 1), Person(b, 2)) val rdd = sc.parallelize(persons) val dataFrame = sqlContext.createDataFrame(rdd) dataFrame.saveAsParquetFile(test.parquet) } {code} All the time the exception that I am getting is : {code} Exception in thread main java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010) at org.apache.hadoop.util.Shell.runCommand(Shell.java:404) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.util.Shell.execCommand(Shell.java:678) at org.apache.hadoop.util.Shell.execCommand(Shell.java:661) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639) at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:772) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:409) at parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:401) at org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:443) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.prepareMetadata(newParquet.scala:240) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:256) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:251) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:370) at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:96) at org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:125) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308) at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1123) at org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:922) at sparkTest.DataFrameTest$.delayedEndpoint$sparkTest$DataFrameTest$1(DataFrameTest.scala:17) at sparkTest.DataFrameTest$delayedInit$body.apply(DataFrameTest.scala:8) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at
[jira] [Updated] (SPARK-6869) Add pyspark archives path to PYTHONPATH
[ https://issues.apache.org/jira/browse/SPARK-6869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-6869: - Issue Type: Bug (was: Improvement) Add pyspark archives path to PYTHONPATH --- Key: SPARK-6869 URL: https://issues.apache.org/jira/browse/SPARK-6869 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Weizhong Priority: Minor From SPARK-1920 and SPARK-1520 we know PySpark on Yarn can not work when the assembly jar are package by JDK 1.7+, so ship pyspark archives to executors by Yarn with --py-files. The pyspark archives name must contains spark-pyspark. 1st: zip pyspark to spark-pyspark_2.10.zip 2nd:./bin/spark-submit --master yarn-client/yarn-cluster --py-files spark-pyspark_2.10.zip app.py args -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534622#comment-14534622 ] Steve Loughran commented on SPARK-7481: --- hadoop openstack 100K +httpclient (400K) hadoop-aws : 85K, jetset 500K s3a needs the aws toolkit @ 11.5MB, so it's the big one azure is 500K. to retain s3n in spark, the hadoop-aws and jetset dependency needs to go in; s3a is a fairly large additions Add Hadoop 2.6+ profile to pull in object store FS accessors Key: SPARK-7481 URL: https://issues.apache.org/jira/browse/SPARK-7481 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1 Reporter: Steve Loughran To keep the s3n classpath right, to add s3a, swift azure, the dependencies of spark in a 2.6+ profile need to add the relevant object store packages (hadoop-aws, hadoop-openstack, hadoop-azure) this adds more stuff to the client bundle, but will mean a single spark package can talk to all of the stores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7381) Missing Python API for o.a.s.ml
[ https://issues.apache.org/jira/browse/SPARK-7381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-7381: --- Summary: Missing Python API for o.a.s.ml (was: Python API for Transformers) Missing Python API for o.a.s.ml --- Key: SPARK-7381 URL: https://issues.apache.org/jira/browse/SPARK-7381 Project: Spark Issue Type: Umbrella Components: ML, PySpark Reporter: Burak Yavuz Assignee: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7447) Large Job submission lag when using Parquet w/ Schema Merging
[ https://issues.apache.org/jira/browse/SPARK-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7447: --- Assignee: (was: Apache Spark) Large Job submission lag when using Parquet w/ Schema Merging - Key: SPARK-7447 URL: https://issues.apache.org/jira/browse/SPARK-7447 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, Spark Submit Affects Versions: 1.3.0, 1.3.1 Environment: Spark 1.3.1, aws, persistent hdfs version 2 with ebs storage, pyspark, 8 x c3.8xlarge nodes. spark-conf spark.executor.memory 50g spark.driver.cores 32 spark.driver.memory 50g spark.default.parallelism 512 spark.sql.shuffle.partitions 512 spark.task.maxFailures 30 spark.executor.logs.rolling.maxRetainedFiles 2 spark.executor.logs.rolling.size.maxBytes 102400 spark.executor.logs.rolling.strategy size spark.shuffle.spill false spark.sql.parquet.cacheMetadata true spark.sql.parquet.filterPushdown true spark.sql.codegen true spark.akka.threads = 64 Reporter: Brad Willard I have 2.6 billion rows in parquet format and I'm trying to use the new schema merging feature (I was enforcing a consistent schema manually before in 0.8-1.2 which was annoying). I have approximate 200 parquet files with key=date. When I load the dataframe with the sqlcontext that process is understandably slow because I assume it's reading all the meta data from the parquet files and doing the initial schema merging. So that's ok. However the problem I have is that once I have the dataframe. Doing any operation on the dataframe seems to have a 10-30 second lag before it actually starts processing the Job and shows up as an Active Job in the Spark Manager. This was an instant operation in all previous versions of Spark. Once the job actually is running the performance is fantastic, however this job submission lag is horrible. I'm wondering if there is a bug with recomputing the schema merging. Running top on the master node shows some thread maxed out on 1 cpu during the lagging time which makes me think it's not net i/o but something pre-processing before job submission. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7447) Large Job submission lag when using Parquet w/ Schema Merging
[ https://issues.apache.org/jira/browse/SPARK-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7447: --- Assignee: Apache Spark Large Job submission lag when using Parquet w/ Schema Merging - Key: SPARK-7447 URL: https://issues.apache.org/jira/browse/SPARK-7447 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, Spark Submit Affects Versions: 1.3.0, 1.3.1 Environment: Spark 1.3.1, aws, persistent hdfs version 2 with ebs storage, pyspark, 8 x c3.8xlarge nodes. spark-conf spark.executor.memory 50g spark.driver.cores 32 spark.driver.memory 50g spark.default.parallelism 512 spark.sql.shuffle.partitions 512 spark.task.maxFailures 30 spark.executor.logs.rolling.maxRetainedFiles 2 spark.executor.logs.rolling.size.maxBytes 102400 spark.executor.logs.rolling.strategy size spark.shuffle.spill false spark.sql.parquet.cacheMetadata true spark.sql.parquet.filterPushdown true spark.sql.codegen true spark.akka.threads = 64 Reporter: Brad Willard Assignee: Apache Spark I have 2.6 billion rows in parquet format and I'm trying to use the new schema merging feature (I was enforcing a consistent schema manually before in 0.8-1.2 which was annoying). I have approximate 200 parquet files with key=date. When I load the dataframe with the sqlcontext that process is understandably slow because I assume it's reading all the meta data from the parquet files and doing the initial schema merging. So that's ok. However the problem I have is that once I have the dataframe. Doing any operation on the dataframe seems to have a 10-30 second lag before it actually starts processing the Job and shows up as an Active Job in the Spark Manager. This was an instant operation in all previous versions of Spark. Once the job actually is running the performance is fantastic, however this job submission lag is horrible. I'm wondering if there is a bug with recomputing the schema merging. Running top on the master node shows some thread maxed out on 1 cpu during the lagging time which makes me think it's not net i/o but something pre-processing before job submission. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7447) Large Job submission lag when using Parquet w/ Schema Merging
[ https://issues.apache.org/jira/browse/SPARK-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535018#comment-14535018 ] Apache Spark commented on SPARK-7447: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/6012 Large Job submission lag when using Parquet w/ Schema Merging - Key: SPARK-7447 URL: https://issues.apache.org/jira/browse/SPARK-7447 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, Spark Submit Affects Versions: 1.3.0, 1.3.1 Environment: Spark 1.3.1, aws, persistent hdfs version 2 with ebs storage, pyspark, 8 x c3.8xlarge nodes. spark-conf spark.executor.memory 50g spark.driver.cores 32 spark.driver.memory 50g spark.default.parallelism 512 spark.sql.shuffle.partitions 512 spark.task.maxFailures 30 spark.executor.logs.rolling.maxRetainedFiles 2 spark.executor.logs.rolling.size.maxBytes 102400 spark.executor.logs.rolling.strategy size spark.shuffle.spill false spark.sql.parquet.cacheMetadata true spark.sql.parquet.filterPushdown true spark.sql.codegen true spark.akka.threads = 64 Reporter: Brad Willard I have 2.6 billion rows in parquet format and I'm trying to use the new schema merging feature (I was enforcing a consistent schema manually before in 0.8-1.2 which was annoying). I have approximate 200 parquet files with key=date. When I load the dataframe with the sqlcontext that process is understandably slow because I assume it's reading all the meta data from the parquet files and doing the initial schema merging. So that's ok. However the problem I have is that once I have the dataframe. Doing any operation on the dataframe seems to have a 10-30 second lag before it actually starts processing the Job and shows up as an Active Job in the Spark Manager. This was an instant operation in all previous versions of Spark. Once the job actually is running the performance is fantastic, however this job submission lag is horrible. I'm wondering if there is a bug with recomputing the schema merging. Running top on the master node shows some thread maxed out on 1 cpu during the lagging time which makes me think it's not net i/o but something pre-processing before job submission. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7477) TachyonBlockManager Store Block in TRY_CACHE mode which gives BlockNotFoundException when blocks are evicted from cache
[ https://issues.apache.org/jira/browse/SPARK-7477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534805#comment-14534805 ] Dibyendu Bhattacharya commented on SPARK-7477: -- I tried Hierarchical Storage on Tachyon ( http://tachyon-project.org/Hierarchy-Storage-on-Tachyon.html ) , and that seems to have worked and I did not see any any Spark Job failed due to BlockNotFoundException. below is my Hierarchical Storage settings.. -Dtachyon.worker.hierarchystore.level.max=2 -Dtachyon.worker.hierarchystore.level0.alias=MEM -Dtachyon.worker.hierarchystore.level0.dirs.path=$TACHYON_RAM_FOLDER -Dtachyon.worker.hierarchystore.level0.dirs.quota=$TACHYON_WORKER_MEMORY_SIZE -Dtachyon.worker.hierarchystore.level1.alias=HDD -Dtachyon.worker.hierarchystore.level1.dirs.path=/mnt/tachyon -Dtachyon.worker.hierarchystore.level1.dirs.quota=50GB -Dtachyon.worker.allocate.strategy=MAX_FREE -Dtachyon.worker.evict.strategy=LRU TachyonBlockManager Store Block in TRY_CACHE mode which gives BlockNotFoundException when blocks are evicted from cache --- Key: SPARK-7477 URL: https://issues.apache.org/jira/browse/SPARK-7477 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.4.0 Reporter: Dibyendu Bhattacharya With Spark Streaming on Tachyon as the OFF_HEAP block store I have used the low level Receiver based Kafka consumer (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) for Spark Streaming to pull from Kafka and write Blocks to Tachyon What I see TachyonBlockManager.scala put the blocks in WriteType.TRY_CACHE configuration . And because of this Blocks ate evicted from Tachyon Cache and when Spark try to find the block it throws BlockNotFoundException . When I modified the WriteType to CACHE_THROUGH , BlockDropException is gone , but it impact the throughput .. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3454) Expose JSON representation of data shown in WebUI
[ https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3454. Resolution: Fixed Expose JSON representation of data shown in WebUI - Key: SPARK-3454 URL: https://issues.apache.org/jira/browse/SPARK-3454 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.1.0 Reporter: Kousuke Saruta Assignee: Imran Rashid Fix For: 1.4.0 Attachments: sparkmonitoringjsondesign.pdf If WebUI support to JSON format extracting, it's helpful for user who want to analyse stage / task / executor information. Fortunately, WebUI has renderJson method so we can implement the method in each subclass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7488) Python API for ml.recommendation
Burak Yavuz created SPARK-7488: -- Summary: Python API for ml.recommendation Key: SPARK-7488 URL: https://issues.apache.org/jira/browse/SPARK-7488 Project: Spark Issue Type: Sub-task Reporter: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7489) Spark shell crashes when compiled with scala 2.11 and SPARK_PREPEND_CLASSES=true
[ https://issues.apache.org/jira/browse/SPARK-7489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7489: --- Assignee: (was: Apache Spark) Spark shell crashes when compiled with scala 2.11 and SPARK_PREPEND_CLASSES=true Key: SPARK-7489 URL: https://issues.apache.org/jira/browse/SPARK-7489 Project: Spark Issue Type: Bug Components: Spark Shell Reporter: Vinod KC Steps followed export SPARK_PREPEND_CLASSES=true dev/change-version-to-2.11.sh sbt/sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean assembly bin/spark-shell 15/05/08 22:31:35 INFO Main: Created spark context.. Spark context available as sc. java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671) at java.lang.Class.getConstructor0(Class.java:3075) at java.lang.Class.getConstructor(Class.java:1825) at org.apache.spark.repl.Main$.createSQLContext(Main.scala:86) ... 45 elided Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 50 more console:11: error: not found: value sqlContext import sqlContext.implicits._ ^ console:11: error: not found: value sqlContext import sqlContext.sql There is a similar Resolved JIRA issue -SPARK-7470 and a PR https://github.com/apache/spark/pull/5997 , which handled same issue only in scala 2.10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7489) Spark shell crashes when compiled with scala 2.11 and SPARK_PREPEND_CLASSES=true
[ https://issues.apache.org/jira/browse/SPARK-7489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7489: --- Assignee: Apache Spark Spark shell crashes when compiled with scala 2.11 and SPARK_PREPEND_CLASSES=true Key: SPARK-7489 URL: https://issues.apache.org/jira/browse/SPARK-7489 Project: Spark Issue Type: Bug Components: Spark Shell Reporter: Vinod KC Assignee: Apache Spark Steps followed export SPARK_PREPEND_CLASSES=true dev/change-version-to-2.11.sh sbt/sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean assembly bin/spark-shell 15/05/08 22:31:35 INFO Main: Created spark context.. Spark context available as sc. java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671) at java.lang.Class.getConstructor0(Class.java:3075) at java.lang.Class.getConstructor(Class.java:1825) at org.apache.spark.repl.Main$.createSQLContext(Main.scala:86) ... 45 elided Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 50 more console:11: error: not found: value sqlContext import sqlContext.implicits._ ^ console:11: error: not found: value sqlContext import sqlContext.sql There is a similar Resolved JIRA issue -SPARK-7470 and a PR https://github.com/apache/spark/pull/5997 , which handled same issue only in scala 2.10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7489) Spark shell crashes when compiled with scala 2.11 and SPARK_PREPEND_CLASSES=true
[ https://issues.apache.org/jira/browse/SPARK-7489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535053#comment-14535053 ] Apache Spark commented on SPARK-7489: - User 'vinodkc' has created a pull request for this issue: https://github.com/apache/spark/pull/6013 Spark shell crashes when compiled with scala 2.11 and SPARK_PREPEND_CLASSES=true Key: SPARK-7489 URL: https://issues.apache.org/jira/browse/SPARK-7489 Project: Spark Issue Type: Bug Components: Spark Shell Reporter: Vinod KC Steps followed export SPARK_PREPEND_CLASSES=true dev/change-version-to-2.11.sh sbt/sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean assembly bin/spark-shell 15/05/08 22:31:35 INFO Main: Created spark context.. Spark context available as sc. java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671) at java.lang.Class.getConstructor0(Class.java:3075) at java.lang.Class.getConstructor(Class.java:1825) at org.apache.spark.repl.Main$.createSQLContext(Main.scala:86) ... 45 elided Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 50 more console:11: error: not found: value sqlContext import sqlContext.implicits._ ^ console:11: error: not found: value sqlContext import sqlContext.sql There is a similar Resolved JIRA issue -SPARK-7470 and a PR https://github.com/apache/spark/pull/5997 , which handled same issue only in scala 2.10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7489) Spark shell crashes when compiled with scala 2.11 and SPARK_PREPEND_CLASSES=true
Vinod KC created SPARK-7489: --- Summary: Spark shell crashes when compiled with scala 2.11 and SPARK_PREPEND_CLASSES=true Key: SPARK-7489 URL: https://issues.apache.org/jira/browse/SPARK-7489 Project: Spark Issue Type: Bug Components: Spark Shell Reporter: Vinod KC Steps followed export SPARK_PREPEND_CLASSES=true dev/change-version-to-2.11.sh sbt/sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean assembly bin/spark-shell 15/05/08 22:31:35 INFO Main: Created spark context.. Spark context available as sc. java.lang.NoClassDefFoundError: org/apache/hadoop/hive/conf/HiveConf at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671) at java.lang.Class.getConstructor0(Class.java:3075) at java.lang.Class.getConstructor(Class.java:1825) at org.apache.spark.repl.Main$.createSQLContext(Main.scala:86) ... 45 elided Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.conf.HiveConf at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 50 more console:11: error: not found: value sqlContext import sqlContext.implicits._ ^ console:11: error: not found: value sqlContext import sqlContext.sql There is a similar Resolved JIRA issue -SPARK-7470 and a PR https://github.com/apache/spark/pull/5997 , which handled same issue only in scala 2.10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6091) Add MulticlassMetrics in PySpark/MLlib
[ https://issues.apache.org/jira/browse/SPARK-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6091: --- Assignee: Apache Spark (was: Yanbo Liang) Add MulticlassMetrics in PySpark/MLlib -- Key: SPARK-6091 URL: https://issues.apache.org/jira/browse/SPARK-6091 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7487) Python API for ml.regression
Burak Yavuz created SPARK-7487: -- Summary: Python API for ml.regression Key: SPARK-7487 URL: https://issues.apache.org/jira/browse/SPARK-7487 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7448) Implement custom bye array serializer for use in PySpark shuffle
[ https://issues.apache.org/jira/browse/SPARK-7448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535316#comment-14535316 ] Josh Rosen commented on SPARK-7448: --- This is a change that would be nice to performance benchmark. It might require a large job, such as a huge flatMap, before we see any significant improvement here. Implement custom bye array serializer for use in PySpark shuffle Key: SPARK-7448 URL: https://issues.apache.org/jira/browse/SPARK-7448 Project: Spark Issue Type: Improvement Components: PySpark, Shuffle Reporter: Josh Rosen PySpark's shuffle typically shuffles Java RDDs that contain byte arrays. We should implement a custom Serializer for use in these shuffles. This will allow us to take advantage of shuffle optimizations like SPARK-7311 for PySpark without requiring users to change the default serializer to KryoSerializer (this is useful for JobServer-type applications). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7490) MapOutputTracker: close input streams to free native memory
Evan Jones created SPARK-7490: - Summary: MapOutputTracker: close input streams to free native memory Key: SPARK-7490 URL: https://issues.apache.org/jira/browse/SPARK-7490 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Evan Jones Priority: Minor GZIPInputStream allocates native memory that is not freed until close() or when the finalizer runs. It is best to close() these streams explicitly to avoid native memory leaks Pull request here: https://github.com/apache/spark/pull/5982 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7487) Python API for ml.regression
[ https://issues.apache.org/jira/browse/SPARK-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535324#comment-14535324 ] Apache Spark commented on SPARK-7487: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/6016 Python API for ml.regression Key: SPARK-7487 URL: https://issues.apache.org/jira/browse/SPARK-7487 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7487) Python API for ml.regression
[ https://issues.apache.org/jira/browse/SPARK-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7487: --- Assignee: (was: Apache Spark) Python API for ml.regression Key: SPARK-7487 URL: https://issues.apache.org/jira/browse/SPARK-7487 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7487) Python API for ml.regression
[ https://issues.apache.org/jira/browse/SPARK-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7487: --- Assignee: Apache Spark Python API for ml.regression Key: SPARK-7487 URL: https://issues.apache.org/jira/browse/SPARK-7487 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Burak Yavuz Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7491) Handle drivers for Metastore JDBC
Michael Armbrust created SPARK-7491: --- Summary: Handle drivers for Metastore JDBC Key: SPARK-7491 URL: https://issues.apache.org/jira/browse/SPARK-7491 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7410) Add option to avoid broadcasting configuration with newAPIHadoopFile
[ https://issues.apache.org/jira/browse/SPARK-7410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535386#comment-14535386 ] Josh Rosen commented on SPARK-7410: --- We should confirm this, but if I recall the reason that we have to broadcast these separately has something to do with configuration mutability or thread-safety. Based on a quick glance at SPARK-2585, it looks like I tried folding this into the RDD broadcast but this caused performance issues for RDDs with huge numbers of tasks. If you're interested in fixing this, I'd take a closer look through that old JIRA to try to figure out whether its discussion is still relevant. Add option to avoid broadcasting configuration with newAPIHadoopFile Key: SPARK-7410 URL: https://issues.apache.org/jira/browse/SPARK-7410 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.4.0 Reporter: Sandy Ryza I'm working with a Spark application that creates thousands of HadoopRDDs and unions them together. Certain details of the way the data is stored require this. Creating ten thousand of these RDDs takes about 10 minutes, even before any of them is used in an action. I dug into why this takes so long and it looks like the overhead of broadcasting the Hadoop configuration is taking up most of the time. In this case, the broadcasting isn't helpful because each HadoopRDD only corresponds to one or two tasks. When I reverted the original change that switched to broadcasting configurations, the time it took to instantiate these RDDs improved 10x. It would be nice if there was a way to turn this broadcasting off. Either through a Spark configuration option, a Hadoop configuration option, or an argument to hadoopFile / newAPIHadoopFile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7383) Python API for ml.feature
[ https://issues.apache.org/jira/browse/SPARK-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7383. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5991 [https://github.com/apache/spark/pull/5991] Python API for ml.feature - Key: SPARK-7383 URL: https://issues.apache.org/jira/browse/SPARK-7383 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Burak Yavuz Assignee: Burak Yavuz Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7436) Cannot implement nor use custom StandaloneRecoveryModeFactory implementations
[ https://issues.apache.org/jira/browse/SPARK-7436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-7436: -- Assignee: Jacek Lewandowski Cannot implement nor use custom StandaloneRecoveryModeFactory implementations - Key: SPARK-7436 URL: https://issues.apache.org/jira/browse/SPARK-7436 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.3.1 Reporter: Jacek Lewandowski Assignee: Jacek Lewandowski Fix For: 1.3.2, 1.4.0 At least, this code fragment is buggy ({{Master.scala}}): {code} case CUSTOM = val clazz = Class.forName(conf.get(spark.deploy.recoveryMode.factory)) val factory = clazz.getConstructor(conf.getClass, Serialization.getClass) .newInstance(conf, SerializationExtension(context.system)) .asInstanceOf[StandaloneRecoveryModeFactory] (factory.createPersistenceEngine(), factory.createLeaderElectionAgent(this)) {code} Because here: {{val factory = clazz.getConstructor(conf.getClass, Serialization.getClass)}} it tries to find the constructor which accepts {{org.apache.spark.SparkConf}} and class of companion object of {{akka.serialization.Serialization}} and then it tries to instantiate {{newInstance(conf, SerializationExtension(context.system))}} with instance of {{SparkConf}} and instance of {{Serialization}} class - not the companion objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7493) ALTER TABLE statement
Sergey Semichev created SPARK-7493: -- Summary: ALTER TABLE statement Key: SPARK-7493 URL: https://issues.apache.org/jira/browse/SPARK-7493 Project: Spark Issue Type: Bug Components: SQL Environment: Databricks cloud Reporter: Sergey Semichev Priority: Minor Full table name (database_name.table_name) cannot be used with ALTER TABLE statement It works with CREATE TABLE ALTER TABLE database_name.table_name ADD PARTITION (source_year='2014', source_month='01'). Error in SQL statement: java.lang.RuntimeException: org.apache.spark.sql.AnalysisException: mismatched input 'ADD' expecting KW_EXCHANGE near 'test_table' in alter exchange partition; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6824) Fill the docs for DataFrame API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-6824. -- Resolution: Fixed Fix Version/s: 1.4.0 1.5.0 Issue resolved by pull request 5969 [https://github.com/apache/spark/pull/5969] Fill the docs for DataFrame API in SparkR - Key: SPARK-6824 URL: https://issues.apache.org/jira/browse/SPARK-6824 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Assignee: Qian Huang Priority: Blocker Fix For: 1.5.0, 1.4.0 Some of the DataFrame functions in SparkR do not have complete roxygen docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7298) Harmonize style of new UI visualizations
[ https://issues.apache.org/jira/browse/SPARK-7298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-7298. -- Resolution: Fixed Fix Version/s: 1.4.0 Harmonize style of new UI visualizations Key: SPARK-7298 URL: https://issues.apache.org/jira/browse/SPARK-7298 Project: Spark Issue Type: Sub-task Components: Web UI Reporter: Patrick Wendell Assignee: Matei Zaharia Priority: Blocker Fix For: 1.4.0 We need to go through all new visualizations in the web UI and make sure they have consistent style. Both consistent with each-other and consistent with the rest of the UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7133) Implement struct, array, and map field accessor using apply in Scala and __getitem__ in Python
[ https://issues.apache.org/jira/browse/SPARK-7133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-7133. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5744 [https://github.com/apache/spark/pull/5744] Implement struct, array, and map field accessor using apply in Scala and __getitem__ in Python -- Key: SPARK-7133 URL: https://issues.apache.org/jira/browse/SPARK-7133 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Wenchen Fan Priority: Blocker Fix For: 1.4.0 Typing {code} df.col[1] {code} and {code} df.col['field'] {code} is so much eaiser than {code} df.col.getField('field') df.col.getItem(1) {code} This would require us to define (in Column) an apply function in Scala, and a __getitem__ function in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7448) Implement custom bye array serializer for use in PySpark shuffle
[ https://issues.apache.org/jira/browse/SPARK-7448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-7448: -- Priority: Minor (was: Major) Implement custom bye array serializer for use in PySpark shuffle Key: SPARK-7448 URL: https://issues.apache.org/jira/browse/SPARK-7448 Project: Spark Issue Type: Improvement Components: PySpark, Shuffle Reporter: Josh Rosen Priority: Minor PySpark's shuffle typically shuffles Java RDDs that contain byte arrays. We should implement a custom Serializer for use in these shuffles. This will allow us to take advantage of shuffle optimizations like SPARK-7311 for PySpark without requiring users to change the default serializer to KryoSerializer (this is useful for JobServer-type applications). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB
[ https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14534629#comment-14534629 ] Rangarajan Sreenivasan edited comment on SPARK-5928 at 5/8/15 5:51 PM: --- We are hitting a very similar issue. Job fails during the repartition stage. * Ours is a 10-node r3.4x cluster (119 GB 16-CPU per node) * Running Spark version 1.3.1 in Standalone cluster mode * Tried various parallelism values - 50, 100, 200, 500, 800 was (Author: sranga): We are hitting a very similar issue. Job fails during the repartition stage. * Ours is a 10-node r3.8x cluster (119 GB 16-CPU per node) * Running Spark version 1.3.1 in Standalone cluster mode * Tried various parallelism values - 50, 100, 200, 500, 800 Remote Shuffle Blocks cannot be more than 2 GB -- Key: SPARK-5928 URL: https://issues.apache.org/jira/browse/SPARK-5928 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Imran Rashid If a shuffle block is over 2GB, the shuffle fails, with an uninformative exception. The tasks get retried a few times and then eventually the job fails. Here is an example program which can cause the exception: {code} val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore = val n = 3e3.toInt val arr = new Array[Byte](n) //need to make sure the array doesn't compress to something small scala.util.Random.nextBytes(arr) arr } rdd.map { x = (1, x)}.groupByKey().count() {code} Note that you can't trigger this exception in local mode, it only happens on remote fetches. I triggered these exceptions running with {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}} {noformat} 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message= org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame length exceeds 2147483647: 3021252889 - discarded at io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403) at io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at
[jira] [Resolved] (SPARK-7474) ParamGridBuilder's doctest doesn't show up correctly in the generated doc
[ https://issues.apache.org/jira/browse/SPARK-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7474. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 6001 [https://github.com/apache/spark/pull/6001] ParamGridBuilder's doctest doesn't show up correctly in the generated doc - Key: SPARK-7474 URL: https://issues.apache.org/jira/browse/SPARK-7474 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.4.0 {code} from classification import LogisticRegression lr = LogisticRegression() output = ParamGridBuilder().baseOn({lr.labelCol: 'l'}) .baseOn([lr.predictionCol, 'p']) .addGrid(lr.regParam, [1.0, 2.0, 3.0]) .addGrid(lr.maxIter, [1, 5]) .addGrid(lr.featuresCol, ['f']) .build() expected = [ {lr.regParam: 1.0, lr.featuresCol: 'f', lr.maxIter: 1, lr.labelCol: 'l', lr.predictionCol: 'p'}, {lr.regParam: 2.0, lr.featuresCol: 'f', lr.maxIter: 1, lr.labelCol: 'l', lr.predictionCol: 'p'}, {lr.regParam: 3.0, lr.featuresCol: 'f', lr.maxIter: 1, lr.labelCol: 'l', lr.predictionCol: 'p'}, {lr.regParam: 1.0, lr.featuresCol: 'f', lr.maxIter: 5, lr.labelCol: 'l', lr.predictionCol: 'p'}, {lr.regParam: 2.0, lr.featuresCol: 'f', lr.maxIter: 5, lr.labelCol: 'l', lr.predictionCol: 'p'}, {lr.regParam: 3.0, lr.featuresCol: 'f', lr.maxIter: 5, lr.labelCol: 'l', lr.predictionCol: 'p'}] len(output) == len(expected) True all([m in expected for m in output]) True {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7436) Cannot implement nor use custom StandaloneRecoveryModeFactory implementations
[ https://issues.apache.org/jira/browse/SPARK-7436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-7436. --- Resolution: Fixed Fix Version/s: 1.4.0 1.3.2 Issue resolved by pull request 5975 [https://github.com/apache/spark/pull/5975] Cannot implement nor use custom StandaloneRecoveryModeFactory implementations - Key: SPARK-7436 URL: https://issues.apache.org/jira/browse/SPARK-7436 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.3.1 Reporter: Jacek Lewandowski Fix For: 1.3.2, 1.4.0 At least, this code fragment is buggy ({{Master.scala}}): {code} case CUSTOM = val clazz = Class.forName(conf.get(spark.deploy.recoveryMode.factory)) val factory = clazz.getConstructor(conf.getClass, Serialization.getClass) .newInstance(conf, SerializationExtension(context.system)) .asInstanceOf[StandaloneRecoveryModeFactory] (factory.createPersistenceEngine(), factory.createLeaderElectionAgent(this)) {code} Because here: {{val factory = clazz.getConstructor(conf.getClass, Serialization.getClass)}} it tries to find the constructor which accepts {{org.apache.spark.SparkConf}} and class of companion object of {{akka.serialization.Serialization}} and then it tries to instantiate {{newInstance(conf, SerializationExtension(context.system))}} with instance of {{SparkConf}} and instance of {{Serialization}} class - not the companion objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7447) Large Job submission lag when using Parquet w/ Schema Merging
[ https://issues.apache.org/jira/browse/SPARK-7447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14535208#comment-14535208 ] Brad Willard commented on SPARK-7447: - Thanks, you are a hero. Large Job submission lag when using Parquet w/ Schema Merging - Key: SPARK-7447 URL: https://issues.apache.org/jira/browse/SPARK-7447 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, Spark Submit Affects Versions: 1.3.0, 1.3.1 Environment: Spark 1.3.1, aws, persistent hdfs version 2 with ebs storage, pyspark, 8 x c3.8xlarge nodes. spark-conf spark.executor.memory 50g spark.driver.cores 32 spark.driver.memory 50g spark.default.parallelism 512 spark.sql.shuffle.partitions 512 spark.task.maxFailures 30 spark.executor.logs.rolling.maxRetainedFiles 2 spark.executor.logs.rolling.size.maxBytes 102400 spark.executor.logs.rolling.strategy size spark.shuffle.spill false spark.sql.parquet.cacheMetadata true spark.sql.parquet.filterPushdown true spark.sql.codegen true spark.akka.threads = 64 Reporter: Brad Willard I have 2.6 billion rows in parquet format and I'm trying to use the new schema merging feature (I was enforcing a consistent schema manually before in 0.8-1.2 which was annoying). I have approximate 200 parquet files with key=date. When I load the dataframe with the sqlcontext that process is understandably slow because I assume it's reading all the meta data from the parquet files and doing the initial schema merging. So that's ok. However the problem I have is that once I have the dataframe. Doing any operation on the dataframe seems to have a 10-30 second lag before it actually starts processing the Job and shows up as an Active Job in the Spark Manager. This was an instant operation in all previous versions of Spark. Once the job actually is running the performance is fantastic, however this job submission lag is horrible. I'm wondering if there is a bug with recomputing the schema merging. Running top on the master node shows some thread maxed out on 1 cpu during the lagging time which makes me think it's not net i/o but something pre-processing before job submission. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org