[jira] [Created] (SPARK-6735) Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it.
Twinkle Sachdeva created SPARK-6735: --- Summary: Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it. Key: SPARK-6735 URL: https://issues.apache.org/jira/browse/SPARK-6735 Project: Spark Issue Type: Improvement Components: Spark Submit, YARN Affects Versions: 1.3.0, 1.2.1, 1.2.0 Reporter: Twinkle Sachdeva Currently there is a setting (spark.yarn.max.executor.failures ) which tells maximum number of executor failures, after which Application fails. For long running applications, user can require not to kill the application at all or will require such setting relative to a window duration. This improvement is ti provide such options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6431) Couldn't find leader offsets exception when creating KafkaDirectStream
[ https://issues.apache.org/jira/browse/SPARK-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482752#comment-14482752 ] Alberto commented on SPARK-6431: I think you're right Cody. I've been having a look at my code and I've found out that I'm not creating the topic before creating the DirectStream. If you are interested here is the test I am running: https://github.com/ardlema/big-brother/blob/master/src/test/scala/org/ardlema/spark/DwellDetectorTest.scala I completely agree with you, If the topic doesn't exist it should be returning an error and not a misleading empty set. Couldn't find leader offsets exception when creating KafkaDirectStream -- Key: SPARK-6431 URL: https://issues.apache.org/jira/browse/SPARK-6431 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Alberto When I try to create an InputDStream using the createDirectStream method of the KafkaUtils class and the kafka topic does not have any messages yet am getting the following error: org.apache.spark.SparkException: Couldn't find leader offsets for Set() org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't find leader offsets for Set() at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413) If I put a message in the topic before creating the DirectStream everything works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6431) Couldn't find leader offsets exception when creating KafkaDirectStream
[ https://issues.apache.org/jira/browse/SPARK-6431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482752#comment-14482752 ] Alberto edited comment on SPARK-6431 at 4/7/15 7:43 AM: You're absolutely right Cody. I've been having a look at my code and I've found out that I'm not creating the topic before creating the DirectStream. If you are interested here is the test I am running: https://github.com/ardlema/big-brother/blob/master/src/test/scala/org/ardlema/spark/DwellDetectorTest.scala I completely agree with you, If the topic doesn't exist it should be returning an error and not a misleading empty set. was (Author: ardlema): I think you're right Cody. I've been having a look at my code and I've found out that I'm not creating the topic before creating the DirectStream. If you are interested here is the test I am running: https://github.com/ardlema/big-brother/blob/master/src/test/scala/org/ardlema/spark/DwellDetectorTest.scala I completely agree with you, If the topic doesn't exist it should be returning an error and not a misleading empty set. Couldn't find leader offsets exception when creating KafkaDirectStream -- Key: SPARK-6431 URL: https://issues.apache.org/jira/browse/SPARK-6431 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: Alberto When I try to create an InputDStream using the createDirectStream method of the KafkaUtils class and the kafka topic does not have any messages yet am getting the following error: org.apache.spark.SparkException: Couldn't find leader offsets for Set() org.apache.spark.SparkException: org.apache.spark.SparkException: Couldn't find leader offsets for Set() at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createDirectStream$2.apply(KafkaUtils.scala:413) If I put a message in the topic before creating the DirectStream everything works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6721) IllegalStateException when connecting to MongoDB using spark-submit
[ https://issues.apache.org/jira/browse/SPARK-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Rodríguez Trejo updated SPARK-6721: Summary: IllegalStateException when connecting to MongoDB using spark-submit (was: IllegalStateException) IllegalStateException when connecting to MongoDB using spark-submit --- Key: SPARK-6721 URL: https://issues.apache.org/jira/browse/SPARK-6721 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0, 1.2.1, 1.3.0 Environment: Ubuntu 14.04, Java 8, MongoDB 3.0, Spark 1.3 Reporter: Luis Rodríguez Trejo Labels: MongoDB, java.lang.IllegalStateexception, saveAsNewAPIHadoopFile I get the following exception when using saveAsNewAPIHadoopFile: {code} 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 10.0.2.15): java.lang.IllegalStateException: open at org.bson.util.Assertions.isTrue(Assertions.java:36) at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167) at com.mongodb.DBCollection.insert(DBCollection.java:161) at com.mongodb.DBCollection.insert(DBCollection.java:107) at com.mongodb.DBCollection.save(DBCollection.java:1049) at com.mongodb.DBCollection.save(DBCollection.java:1014) at com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Before Spark 1.3.0 this would result in the application crashing, but now the data just remains unprocessed. There is no close instruction at any part of the code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6736) Example of Graph#aggregateMessages has error
Sasaki Toru created SPARK-6736: -- Summary: Example of Graph#aggregateMessages has error Key: SPARK-6736 URL: https://issues.apache.org/jira/browse/SPARK-6736 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.3.0 Reporter: Sasaki Toru Priority: Minor Fix For: 1.4.0 Example of Graph#aggregateMessages has error. Since aggregateMessages is a method of Graph, It should be written rawGraph.aggregateMessages -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6708) Using Hive UDTF may throw ClassNotFoundException
[ https://issues.apache.org/jira/browse/SPARK-6708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482805#comment-14482805 ] Cheng Lian commented on SPARK-6708: --- Thanks for pointing out this. I followed SPARK-4854 and found that this issue duplicates SPARK-4811. Using Hive UDTF may throw ClassNotFoundException Key: SPARK-6708 URL: https://issues.apache.org/jira/browse/SPARK-6708 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.1, 1.2.1, 1.3.0 Reporter: Cheng Lian Spark shell session for reproducing this issue: {code} import sqlContext._ sql(create table t1 (str string)) sql(select v.va from t1 lateral view json_tuple(str, 'a') v as va).queryExecution.analyzed {code} Exception thrown: {noformat} java.lang.ClassNotFoundException: json_tuple at scala.tools.nsc.interpreter.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:83) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.hive.HiveFunctionWrapper.createFunction(Shim13.scala:148) at org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:274) at org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:274) at org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector$lzycompute(hiveUdfs.scala:280) at org.apache.spark.sql.hive.HiveGenericUdtf.outputInspector(hiveUdfs.scala:280) at org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes$lzycompute(hiveUdfs.scala:285) at org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes(hiveUdfs.scala:285) at org.apache.spark.sql.hive.HiveGenericUdtf.makeOutput(hiveUdfs.scala:291) at org.apache.spark.sql.catalyst.expressions.Generator.output(generators.scala:60) at org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$2.apply(basicOperators.scala:60) at org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$2.apply(basicOperators.scala:60) at scala.Option.map(Option.scala:145) at org.apache.spark.sql.catalyst.plans.logical.Generate.generatorOutput(basicOperators.scala:60) at org.apache.spark.sql.catalyst.plans.logical.Generate.output(basicOperators.scala:70) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:117) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:117) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:117) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2$$anonfun$11.apply(Analyzer.scala:292) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2$$anonfun$11.apply(Analyzer.scala:292) at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:292) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$7$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:284) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:252) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:252) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:251) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at
[jira] [Updated] (SPARK-4811) Custom UDTFs not working in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-4811: -- Affects Version/s: 1.2.1 1.3.0 Custom UDTFs not working in Spark SQL - Key: SPARK-4811 URL: https://issues.apache.org/jira/browse/SPARK-4811 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.1, 1.3.0 Reporter: Saurabh Santhosh Priority: Critical I am using the Thrift srever interface to Spark SQL and using beeline to connect to it. I tried Spark SQL versions 1.1.0 and 1.1.1 and both are throwing the following exception when using any custom UDTF. These are the steps i did : *Created a UDTF 'com.x.y.xxx'.* Registered the UDTF using following query : *create temporary function xxx as 'com.x.y.xxx'* The registration went through without any errors. But when i tried executing the UDTF i got the following error. *java.lang.ClassNotFoundException: xxx* Funny thing is that Its trying to load the function name instead of the funtion class. The exception is at *line no: 81 in hiveudfs.scala* I have been at it for quite a long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6736) [GraphX]Example of Graph#aggregateMessages has error
[ https://issues.apache.org/jira/browse/SPARK-6736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482813#comment-14482813 ] Apache Spark commented on SPARK-6736: - User 'sasakitoa' has created a pull request for this issue: https://github.com/apache/spark/pull/5388 [GraphX]Example of Graph#aggregateMessages has error Key: SPARK-6736 URL: https://issues.apache.org/jira/browse/SPARK-6736 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.3.0 Reporter: Sasaki Toru Priority: Minor Fix For: 1.4.0 Example of Graph#aggregateMessages has error. Since aggregateMessages is a method of Graph, It should be written rawGraph.aggregateMessages -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6737) OutputCommitCoordinator.authorizedCommittersByStage map out of memory
[ https://issues.apache.org/jira/browse/SPARK-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Li updated SPARK-6737: -- Description: I am using spark streaming(1.3.1) as a long time running service and out of memory after running for 7 days. I found that the field authorizedCommittersByStage in OutputCommitCoordinator class cause the OOM. authorizedCommittersByStage is a map, key is StageId, value is Map[PartitionId, TaskAttemptId]. The OutputCommitCoordinator class has a method stageEnd which will remove stageId from authorizedCommittersByStage. But the method stageEnd is never called by DAGSchedule. And it cause the authorizedCommittersByStage's stage info never be cleaned, which cause OOM. It happens in my spark streaming program(1.3.1), I am not sure if it will appear in other spark components and other spark version. was: I am using spark streaming(1.3.1) as a long time running service and out of memory after running for 7 weeks. I found that the field authorizedCommittersByStage in OutputCommitCoordinator class cause the OOM. authorizedCommittersByStage is a map, key is StageId, value is Map[PartitionId, TaskAttemptId]. The OutputCommitCoordinator class has a method stageEnd which will remove stageId from authorizedCommittersByStage. But the method stageEnd is never called by DAGSchedule. And it cause the authorizedCommittersByStage's stage info never be cleaned, which cause OOM. It happens in my spark streaming program(1.3.1), I am not sure if it will appear in other spark components and other spark version. OutputCommitCoordinator.authorizedCommittersByStage map out of memory - Key: SPARK-6737 URL: https://issues.apache.org/jira/browse/SPARK-6737 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Affects Versions: 1.3.0 Environment: spark 1.3.1 Reporter: Tao Li Priority: Critical Labels: Bug, Core, DAGScheduler, OOM, Streaming I am using spark streaming(1.3.1) as a long time running service and out of memory after running for 7 days. I found that the field authorizedCommittersByStage in OutputCommitCoordinator class cause the OOM. authorizedCommittersByStage is a map, key is StageId, value is Map[PartitionId, TaskAttemptId]. The OutputCommitCoordinator class has a method stageEnd which will remove stageId from authorizedCommittersByStage. But the method stageEnd is never called by DAGSchedule. And it cause the authorizedCommittersByStage's stage info never be cleaned, which cause OOM. It happens in my spark streaming program(1.3.1), I am not sure if it will appear in other spark components and other spark version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6514) For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself
[ https://issues.apache.org/jira/browse/SPARK-6514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482749#comment-14482749 ] Paweł Kopiczko commented on SPARK-6514: --- I named the param `regionName` after the one in the KinesisClientLibConfiguration. I thought it will obvious when someone is familiar with KCL consumer API. But dynamoRegion is also good. We can extract region from streamURL but maybe someone would want to set different region than the stream location. Why do you think that implementation exposure is not good? Don't you think that everything that is configurable in KCL client that we use should also be configurable when creating DStream? For Kinesis Streaming, use the same region for DynamoDB (KCL checkpoints) as the Kinesis stream itself Key: SPARK-6514 URL: https://issues.apache.org/jira/browse/SPARK-6514 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Chris Fregly context: i started the original Kinesis impl with KCL 1.0 (not supported), then finished on KCL 1.1 (supported) without realizing that it's supported. also, we should upgrade to the latest Kinesis Client Library (KCL) which is currently v1.2 right now, i believe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6721) IllegalStateException when connecting to MongoDB using spark-submit
[ https://issues.apache.org/jira/browse/SPARK-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482798#comment-14482798 ] Luis Rodríguez Trejo commented on SPARK-6721: - [~sowen] thank you for your comments. I modified de name. I think it is more meaningful now. Referring to your first comment, actually I had no idea about which was the component that caused the problem. I apologize. Should I close the issue? IllegalStateException when connecting to MongoDB using spark-submit --- Key: SPARK-6721 URL: https://issues.apache.org/jira/browse/SPARK-6721 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0, 1.2.1, 1.3.0 Environment: Ubuntu 14.04, Java 8, MongoDB 3.0, Spark 1.3 Reporter: Luis Rodríguez Trejo Labels: MongoDB, java.lang.IllegalStateexception, saveAsNewAPIHadoopFile I get the following exception when using saveAsNewAPIHadoopFile: {code} 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 10.0.2.15): java.lang.IllegalStateException: open at org.bson.util.Assertions.isTrue(Assertions.java:36) at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167) at com.mongodb.DBCollection.insert(DBCollection.java:161) at com.mongodb.DBCollection.insert(DBCollection.java:107) at com.mongodb.DBCollection.save(DBCollection.java:1049) at com.mongodb.DBCollection.save(DBCollection.java:1014) at com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Before Spark 1.3.0 this would result in the application crashing, but now the data just remains unprocessed. There is no close instruction at any part of the code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4811) Custom UDTFs not working in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-4811. --- Resolution: Duplicate Although this ticket was opened earlier, I mark this a duplicate of SPARK-6708 because SPARK-6708 gives clear reproduction steps. Custom UDTFs not working in Spark SQL - Key: SPARK-4811 URL: https://issues.apache.org/jira/browse/SPARK-4811 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1, 1.2.1, 1.3.0 Reporter: Saurabh Santhosh Priority: Critical I am using the Thrift srever interface to Spark SQL and using beeline to connect to it. I tried Spark SQL versions 1.1.0 and 1.1.1 and both are throwing the following exception when using any custom UDTF. These are the steps i did : *Created a UDTF 'com.x.y.xxx'.* Registered the UDTF using following query : *create temporary function xxx as 'com.x.y.xxx'* The registration went through without any errors. But when i tried executing the UDTF i got the following error. *java.lang.ClassNotFoundException: xxx* Funny thing is that Its trying to load the function name instead of the funtion class. The exception is at *line no: 81 in hiveudfs.scala* I have been at it for quite a long time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6737) OutputCommitCoordinator.authorizedCommittersByStage map out of memory
[ https://issues.apache.org/jira/browse/SPARK-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Li updated SPARK-6737: -- Description: I am using spark streaming(1.3.1) as a long time running service and out of memory after running for 7 weeks. I found that the field authorizedCommittersByStage in OutputCommitCoordinator class cause the OOM. authorizedCommittersByStage is a map, key is StageId, value is Map[PartitionId, TaskAttemptId]. The OutputCommitCoordinator class has a method stageEnd which will remove stageId from authorizedCommittersByStage. But the method stageEnd is never called by DAGSchedule. And it cause the authorizedCommittersByStage's stage info never be cleaned, which cause OOM. It happens in my spark streaming program(1.3.1), I am not sure if it will appear in other spark components and other spark version. was: I am using spark streaming(1.3.1) as a long time running service and out of memory after running for 7 weeks. I found that the field authorizedCommittersByStage in OutputCommitCoordinator class cause the OOM. authorizedCommittersByStage is a map, key is StageId, value is Map[PartitionId, TaskAttemptId]. The OutputCommitCoordinator class has a method stageEnd which will remove stageId from authorizedCommittersByStage. But the method stageEnd is never called by DAGSchedule. And it cause the authorizedCommittersByStage's stage info never be cleaned, which cause OOM. It happens in my spark streaming program(1.3.1), I am not sure if it will appear in other spark components and other spark version. OutputCommitCoordinator.authorizedCommittersByStage map out of memory - Key: SPARK-6737 URL: https://issues.apache.org/jira/browse/SPARK-6737 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Affects Versions: 1.3.0 Environment: spark 1.3.1 Reporter: Tao Li Priority: Critical Labels: Bug, Core, DAGScheduler, OOM, Streaming I am using spark streaming(1.3.1) as a long time running service and out of memory after running for 7 weeks. I found that the field authorizedCommittersByStage in OutputCommitCoordinator class cause the OOM. authorizedCommittersByStage is a map, key is StageId, value is Map[PartitionId, TaskAttemptId]. The OutputCommitCoordinator class has a method stageEnd which will remove stageId from authorizedCommittersByStage. But the method stageEnd is never called by DAGSchedule. And it cause the authorizedCommittersByStage's stage info never be cleaned, which cause OOM. It happens in my spark streaming program(1.3.1), I am not sure if it will appear in other spark components and other spark version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6737) OutputCommitCoordinator.authorizedCommittersByStage map out of memory
Tao Li created SPARK-6737: - Summary: OutputCommitCoordinator.authorizedCommittersByStage map out of memory Key: SPARK-6737 URL: https://issues.apache.org/jira/browse/SPARK-6737 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Affects Versions: 1.3.0 Environment: spark 1.3.1 Reporter: Tao Li Priority: Critical I am using spark streaming(1.3.1) as a long time running service and out of memory after running for 7 weeks. I found that the field authorizedCommittersByStage in OutputCommitCoordinator class cause the OOM. authorizedCommittersByStage is a map, key is StageId, value is Map[PartitionId, TaskAttemptId]. The OutputCommitCoordinator class has a method stageEnd which will remove stageId from authorizedCommittersByStage. But the method stageEnd is never called by DAGSchedule. And it cause the authorizedCommittersByStage's stage info never be cleaned, which cause OOM. It happens in my spark streaming program(1.3.1), I am not sure if it will appear in other spark components and other spark version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6695) Add an external iterator: a hadoop-like output collector
[ https://issues.apache.org/jira/browse/SPARK-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482768#comment-14482768 ] uncleGen commented on SPARK-6695: - [~srowen] Thanks for your patience. Yeah, it is a good fix with the smallest number of changes. In my practical use, we usually need to create a big array, sometimes the size of array may exceed 2^32. So IMHO, we may provide a general external `Iterator` in view of usability and memory usage. Well, [PR-5364|https://github.com/apache/spark/pull/5364] is enough on that issue. Add an external iterator: a hadoop-like output collector Key: SPARK-6695 URL: https://issues.apache.org/jira/browse/SPARK-6695 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: uncleGen In practical use, we usually need to create a big iterator, which means too big in `memory usage` or too long in `array size`. On the one hand, it leads to too much memory consumption. On the other hand, one `Array` may not hold all the elements, as java array indices are of type 'int' (4 bytes or 32 bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or any others, and could spill data into disk. The use case may like: {code: borderStyle=solid} rdd.mapPartition { it = ... val collector = new ExternalCollector() collector.collect(a) ... collector.iterator } {code} I have done some related works, and I need your opinions, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4854) Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI
[ https://issues.apache.org/jira/browse/SPARK-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482802#comment-14482802 ] Cheng Lian commented on SPARK-4854: --- [~wanshenghua] Is the XXX in ClassNotFoundException a class name or a UDTF function name? I'm trying to figure out whether this is related to SPARK-6708. Thanks! Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI - Key: SPARK-4854 URL: https://issues.apache.org/jira/browse/SPARK-4854 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.1.1 Reporter: Shenghua Wan Hello, I met a problem when using Spark sql CLI. A custom UDTF with lateral view throws ClassNotFound exception. I did a couple of experiments in same environment (spark version 1.1.0, 1.1.1): select + same custom UDTF (Passed) select + lateral view + custom UDTF (ClassNotFoundException) select + lateral view + built-in UDTF (Passed) I have done some googling there days and found one related issue ticket of Spark https://issues.apache.org/jira/browse/SPARK-4811 which is about Custom UDTFs not working in Spark SQL. It should be helpful to put actual code here to reproduce the problem. However, corporate regulations might prohibit this. So sorry about this. Directly using explode's source code in a jar will help anyway. Here is a portion of stack print when exception, just in case: java.lang.ClassNotFoundException: XXX at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at org.apache.spark.sql.hive.HiveFunctionFactory$class.createFunction(hiveUdfs.scala:81) at org.apache.spark.sql.hive.HiveGenericUdtf.createFunction(hiveUdfs.scala:247) at org.apache.spark.sql.hive.HiveGenericUdtf.function$lzycompute(hiveUdfs.scala:254) at org.apache.spark.sql.hive.HiveGenericUdtf.function(hiveUdfs.scala:254) at org.apache.spark.sql.hive.HiveGenericUdtf.outputInspectors$lzycompute(hiveUdfs.scala:261) at org.apache.spark.sql.hive.HiveGenericUdtf.outputInspectors(hiveUdfs.scala:260) at org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes$lzycompute(hiveUdfs.scala:265) at org.apache.spark.sql.hive.HiveGenericUdtf.outputDataTypes(hiveUdfs.scala:265) at org.apache.spark.sql.hive.HiveGenericUdtf.makeOutput(hiveUdfs.scala:269) at org.apache.spark.sql.catalyst.expressions.Generator.output(generators.scala:60) at org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$1.apply(basicOperators.scala:50) at org.apache.spark.sql.catalyst.plans.logical.Generate$$anonfun$1.apply(basicOperators.scala:50) at scala.Option.map(Option.scala:145) at org.apache.spark.sql.catalyst.plans.logical.Generate.generatorOutput(basicOperators.scala:50) at org.apache.spark.sql.catalyst.plans.logical.Generate.output(basicOperators.scala:60) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:79) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveChildren$1.apply(LogicalPlan.scala:79) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) the rest is omitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6736) [GraphX]Example of Graph#aggregateMessages has error
[ https://issues.apache.org/jira/browse/SPARK-6736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6736: --- Assignee: (was: Apache Spark) [GraphX]Example of Graph#aggregateMessages has error Key: SPARK-6736 URL: https://issues.apache.org/jira/browse/SPARK-6736 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.3.0 Reporter: Sasaki Toru Priority: Minor Fix For: 1.4.0 Example of Graph#aggregateMessages has error. Since aggregateMessages is a method of Graph, It should be written rawGraph.aggregateMessages -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6736) [GraphX]Example of Graph#aggregateMessages has error
[ https://issues.apache.org/jira/browse/SPARK-6736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6736: --- Assignee: Apache Spark [GraphX]Example of Graph#aggregateMessages has error Key: SPARK-6736 URL: https://issues.apache.org/jira/browse/SPARK-6736 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.3.0 Reporter: Sasaki Toru Assignee: Apache Spark Priority: Minor Fix For: 1.4.0 Example of Graph#aggregateMessages has error. Since aggregateMessages is a method of Graph, It should be written rawGraph.aggregateMessages -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6736) [GraphX]Example of Graph#aggregateMessages has error
[ https://issues.apache.org/jira/browse/SPARK-6736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sasaki Toru updated SPARK-6736: --- Summary: [GraphX]Example of Graph#aggregateMessages has error (was: Example of Graph#aggregateMessages has error) [GraphX]Example of Graph#aggregateMessages has error Key: SPARK-6736 URL: https://issues.apache.org/jira/browse/SPARK-6736 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.3.0 Reporter: Sasaki Toru Priority: Minor Fix For: 1.4.0 Example of Graph#aggregateMessages has error. Since aggregateMessages is a method of Graph, It should be written rawGraph.aggregateMessages -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6736) [GraphX]Example of Graph#aggregateMessages has error
[ https://issues.apache.org/jira/browse/SPARK-6736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482803#comment-14482803 ] Apache Spark commented on SPARK-6736: - User 'sasakitoa' has created a pull request for this issue: https://github.com/apache/spark/pull/5387 [GraphX]Example of Graph#aggregateMessages has error Key: SPARK-6736 URL: https://issues.apache.org/jira/browse/SPARK-6736 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.3.0 Reporter: Sasaki Toru Priority: Minor Fix For: 1.4.0 Example of Graph#aggregateMessages has error. Since aggregateMessages is a method of Graph, It should be written rawGraph.aggregateMessages -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6736) [GraphX]Example of Graph#aggregateMessages has error
[ https://issues.apache.org/jira/browse/SPARK-6736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sasaki Toru updated SPARK-6736: --- Component/s: Documentation [GraphX]Example of Graph#aggregateMessages has error Key: SPARK-6736 URL: https://issues.apache.org/jira/browse/SPARK-6736 Project: Spark Issue Type: Improvement Components: Documentation, GraphX Affects Versions: 1.3.0 Reporter: Sasaki Toru Priority: Minor Fix For: 1.4.0 Example of Graph#aggregateMessages has error. Since aggregateMessages is a method of Graph, It should be written rawGraph.aggregateMessages -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6716) Change SparkContext.DRIVER_IDENTIFIER from 'driver' to 'driver'
[ https://issues.apache.org/jira/browse/SPARK-6716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-6716. --- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5372 [https://github.com/apache/spark/pull/5372] Change SparkContext.DRIVER_IDENTIFIER from 'driver' to 'driver' - Key: SPARK-6716 URL: https://issues.apache.org/jira/browse/SPARK-6716 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen Fix For: 1.4.0 Currently, the driver's executorId is set to {{driver}}. This choice of ID was present in older Spark versions, but it has started to cause problems now that executorIds are used in more contexts, such as Ganglia metric names or driver thread-dump links the web UI. The angle brackets must be escaped when embedding this ID in XML or as part of URLs and this has led to multiple problems: - https://issues.apache.org/jira/browse/SPARK-6484 - https://issues.apache.org/jira/browse/SPARK-4313 The simplest solution seems to be to change this id to something that does not contain any special characters, such as {{driver}}. I'm not sure whether we can perform this change in a patch release, since this ID may be considered a stable API by metrics users, but it's probably okay to do this in a major release as long as we document it in the release notes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482685#comment-14482685 ] Apache Spark commented on SPARK-6691: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/5385 Abstract and add a dynamic RateLimiter for Spark Streaming -- Key: SPARK-6691 URL: https://issues.apache.org/jira/browse/SPARK-6691 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Saisai Shao Flow control (or rate control) for input data is very important in streaming system, especially for Spark Streaming to keep stable and up-to-date. The unexpected flood of incoming data or the high ingestion rate of input data which beyond the computation power of cluster will make the system unstable and increase the delay time. For Spark Streaming’s job generation and processing pattern, this delay will be accumulated and introduce unacceptable exceptions. Currently in Spark Streaming’s receiver based input stream, there’s a RateLimiter in BlockGenerator which controls the ingestion rate of input data, but the current implementation has several limitations: # The max ingestion rate is set by user through configuration beforehand, user may lack the experience of how to set an appropriate value before the application is running. # This configuration is fixed through the life-time of application, which means you need to consider the worst scenario to set a reasonable configuration. # Input stream like DirectKafkaInputStream need to maintain another solution to achieve the same functionality. # Lack of slow start control makes the whole system easily trapped into large processing and scheduling delay at the very beginning. So here we propose a new dynamic RateLimiter as well as the new interface for the RateLimiter to better improve the whole system's stability. The target is: * Dynamically adjust the ingestion rate according to processing rate of previous finished jobs. * Offer an uniform solution not only for receiver based input stream, but also for direct stream like DirectKafkaInputStream and new ones. * Slow start rate to control the network congestion when job is started. * Pluggable framework to make the maintenance of extension more easy. Here is the design doc (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing) and working branch (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter). Any comment would be greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6636) Use public DNS hostname everywhere in spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-6636. --- Resolution: Fixed Fix Version/s: 1.4.0 1.3.2 Issue resolved by pull request 5302 [https://github.com/apache/spark/pull/5302] Use public DNS hostname everywhere in spark_ec2.py -- Key: SPARK-6636 URL: https://issues.apache.org/jira/browse/SPARK-6636 Project: Spark Issue Type: Bug Components: EC2 Reporter: Matt Aasted Priority: Minor Fix For: 1.3.2, 1.4.0 The spark_ec2.py script uses public_dns_name everywhere in the script except for testing ssh availability, which is done using the public ip address of the instances. This breaks the script for users who are deploying the cluster with a private-network-only security group. The fix is to use public_dns_name in the remaining place. I am submitting a pull-request alongside this bug report. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6636) Use public DNS hostname everywhere in spark_ec2.py
[ https://issues.apache.org/jira/browse/SPARK-6636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-6636: -- Assignee: Matt Aasted Use public DNS hostname everywhere in spark_ec2.py -- Key: SPARK-6636 URL: https://issues.apache.org/jira/browse/SPARK-6636 Project: Spark Issue Type: Bug Components: EC2 Reporter: Matt Aasted Assignee: Matt Aasted Priority: Minor Fix For: 1.3.2, 1.4.0 The spark_ec2.py script uses public_dns_name everywhere in the script except for testing ssh availability, which is done using the public ip address of the instances. This breaks the script for users who are deploying the cluster with a private-network-only security group. The fix is to use public_dns_name in the remaining place. I am submitting a pull-request alongside this bug report. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482711#comment-14482711 ] Yu Ishikawa commented on SPARK-6682: I think so. In conclusion, I agree with the [~josephkb] 's proposal. I've just announced this issue to the developers list. Someone may be going to join this discussion. However, I think we should make a conclusion, using the [~mengxr] 's opinion as reference. Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6691: --- Assignee: (was: Apache Spark) Abstract and add a dynamic RateLimiter for Spark Streaming -- Key: SPARK-6691 URL: https://issues.apache.org/jira/browse/SPARK-6691 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Saisai Shao Flow control (or rate control) for input data is very important in streaming system, especially for Spark Streaming to keep stable and up-to-date. The unexpected flood of incoming data or the high ingestion rate of input data which beyond the computation power of cluster will make the system unstable and increase the delay time. For Spark Streaming’s job generation and processing pattern, this delay will be accumulated and introduce unacceptable exceptions. Currently in Spark Streaming’s receiver based input stream, there’s a RateLimiter in BlockGenerator which controls the ingestion rate of input data, but the current implementation has several limitations: # The max ingestion rate is set by user through configuration beforehand, user may lack the experience of how to set an appropriate value before the application is running. # This configuration is fixed through the life-time of application, which means you need to consider the worst scenario to set a reasonable configuration. # Input stream like DirectKafkaInputStream need to maintain another solution to achieve the same functionality. # Lack of slow start control makes the whole system easily trapped into large processing and scheduling delay at the very beginning. So here we propose a new dynamic RateLimiter as well as the new interface for the RateLimiter to better improve the whole system's stability. The target is: * Dynamically adjust the ingestion rate according to processing rate of previous finished jobs. * Offer an uniform solution not only for receiver based input stream, but also for direct stream like DirectKafkaInputStream and new ones. * Slow start rate to control the network congestion when job is started. * Pluggable framework to make the maintenance of extension more easy. Here is the design doc (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing) and working branch (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter). Any comment would be greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6691) Abstract and add a dynamic RateLimiter for Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-6691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6691: --- Assignee: Apache Spark Abstract and add a dynamic RateLimiter for Spark Streaming -- Key: SPARK-6691 URL: https://issues.apache.org/jira/browse/SPARK-6691 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Saisai Shao Assignee: Apache Spark Flow control (or rate control) for input data is very important in streaming system, especially for Spark Streaming to keep stable and up-to-date. The unexpected flood of incoming data or the high ingestion rate of input data which beyond the computation power of cluster will make the system unstable and increase the delay time. For Spark Streaming’s job generation and processing pattern, this delay will be accumulated and introduce unacceptable exceptions. Currently in Spark Streaming’s receiver based input stream, there’s a RateLimiter in BlockGenerator which controls the ingestion rate of input data, but the current implementation has several limitations: # The max ingestion rate is set by user through configuration beforehand, user may lack the experience of how to set an appropriate value before the application is running. # This configuration is fixed through the life-time of application, which means you need to consider the worst scenario to set a reasonable configuration. # Input stream like DirectKafkaInputStream need to maintain another solution to achieve the same functionality. # Lack of slow start control makes the whole system easily trapped into large processing and scheduling delay at the very beginning. So here we propose a new dynamic RateLimiter as well as the new interface for the RateLimiter to better improve the whole system's stability. The target is: * Dynamically adjust the ingestion rate according to processing rate of previous finished jobs. * Offer an uniform solution not only for receiver based input stream, but also for direct stream like DirectKafkaInputStream and new ones. * Slow start rate to control the network congestion when job is started. * Pluggable framework to make the maintenance of extension more easy. Here is the design doc (https://docs.google.com/document/d/1lqJDkOYDh_9hRLQRwqvBXcbLScWPmMa7MlG8J_TE93w/edit?usp=sharing) and working branch (https://github.com/jerryshao/apache-spark/tree/dynamic-rate-limiter). Any comment would be greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6736) [GraphX]Example of Graph#aggregateMessages has error
[ https://issues.apache.org/jira/browse/SPARK-6736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-6736. --- Resolution: Fixed Issue resolved by pull request 5388 [https://github.com/apache/spark/pull/5388] [GraphX]Example of Graph#aggregateMessages has error Key: SPARK-6736 URL: https://issues.apache.org/jira/browse/SPARK-6736 Project: Spark Issue Type: Improvement Components: Documentation, GraphX Affects Versions: 1.3.0 Reporter: Sasaki Toru Priority: Minor Fix For: 1.4.0 Example of Graph#aggregateMessages has error. Since aggregateMessages is a method of Graph, It should be written rawGraph.aggregateMessages -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6739) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types
[ https://issues.apache.org/jira/browse/SPARK-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6739: --- Assignee: Apache Spark Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types Key: SPARK-6739 URL: https://issues.apache.org/jira/browse/SPARK-6739 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Tijo Thomas Assignee: Apache Spark Priority: Trivial Missing import in example script under the section Programmatically Specifying the Schema scala val schema = | StructType( | schemaString.split( ).map(fieldName = StructField(fieldName, StringType, true))) console:25: error: not found: value StructType StructType( ^ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6739) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types
[ https://issues.apache.org/jira/browse/SPARK-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482944#comment-14482944 ] Apache Spark commented on SPARK-6739: - User 'tijoparacka' has created a pull request for this issue: https://github.com/apache/spark/pull/5389 Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types Key: SPARK-6739 URL: https://issues.apache.org/jira/browse/SPARK-6739 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Tijo Thomas Priority: Trivial Missing import in example script under the section Programmatically Specifying the Schema scala val schema = | StructType( | schemaString.split( ).map(fieldName = StructField(fieldName, StringType, true))) console:25: error: not found: value StructType StructType( ^ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6739) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types
[ https://issues.apache.org/jira/browse/SPARK-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6739: --- Assignee: (was: Apache Spark) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types Key: SPARK-6739 URL: https://issues.apache.org/jira/browse/SPARK-6739 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Tijo Thomas Priority: Trivial Missing import in example script under the section Programmatically Specifying the Schema scala val schema = | StructType( | schemaString.split( ).map(fieldName = StructField(fieldName, StringType, true))) console:25: error: not found: value StructType StructType( ^ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482917#comment-14482917 ] Yu Ishikawa commented on SPARK-6682: Thanks for replying, [~mengxr]. How about dividing this issue into a few sub issues? Since the area of influence is a little vast. I think we have a few tasks in it. - Add `@deprecated` tags into Scala/Java - Modify the examples - Modify the documentation - (Share this idea with the community in order not to add more static train() methods) Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6739) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types
Tijo Thomas created SPARK-6739: -- Summary: Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types Key: SPARK-6739 URL: https://issues.apache.org/jira/browse/SPARK-6739 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Tijo Thomas Priority: Trivial Missing import in example script under the section Programmatically Specifying the Schema scala val schema = | StructType( | schemaString.split( ).map(fieldName = StructField(fieldName, StringType, true))) console:25: error: not found: value StructType StructType( ^ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6738) EstimateSize is difference with spill file size
Hong Shen created SPARK-6738: Summary: EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 1100M data to disk: 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.8 MB to disk (15 times so far) But the file size is only 1.1M. [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b -rw-r- 1 spark users 1.1M Apr 7 16:39 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b The estimateSize is hugh difference with spill file size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482884#comment-14482884 ] Xiangrui Meng commented on SPARK-6682: -- +1 on deprecating the static train methods and not removing them for binary compatibility in Java/Scala. Python API could remain the same. Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6721) IllegalStateException when connecting to MongoDB using spark-submit
[ https://issues.apache.org/jira/browse/SPARK-6721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482950#comment-14482950 ] Sean Owen commented on SPARK-6721: -- From the stack trace, the error is coming from the Mongo driver. That is mild evidence that something is wrong with / misconfigured within the driver. It looks like it is asserting that a connection is open when it is not. Maybe you are reusing a closed connection or using it before it's configured / open? I usually try to find the source code to see what is going on, like: http://grepcode.com/file/repo1.maven.org/maven2/org.mongodb/mongo-java-driver/2.13.0/com/mongodb/DBTCPConnector.java/ I also find it's deprecated in the latest code https://github.com/mongodb/mongo-java-driver/blob/77b7974d8be49c45dcba01c32a8458b121092f87/config/clirr-exclude.yml Maybe review your code's usage of the Mongo driver in light of the stack trace and source? If you see a reasonable theory about how Spark is not quite calling the Hadoop output format correctly (though here, it's just the generic Hadoop output path, which is used heavily and therefore should be OK), post that here. Otherwise yeah I suspect you have an issue in your code. IllegalStateException when connecting to MongoDB using spark-submit --- Key: SPARK-6721 URL: https://issues.apache.org/jira/browse/SPARK-6721 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.2.0, 1.2.1, 1.3.0 Environment: Ubuntu 14.04, Java 8, MongoDB 3.0, Spark 1.3 Reporter: Luis Rodríguez Trejo Labels: MongoDB, java.lang.IllegalStateexception, saveAsNewAPIHadoopFile I get the following exception when using saveAsNewAPIHadoopFile: {code} 15/03/23 17:05:34 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4, 10.0.2.15): java.lang.IllegalStateException: open at org.bson.util.Assertions.isTrue(Assertions.java:36) at com.mongodb.DBTCPConnector.getPrimaryPort(DBTCPConnector.java:406) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:184) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167) at com.mongodb.DBCollection.insert(DBCollection.java:161) at com.mongodb.DBCollection.insert(DBCollection.java:107) at com.mongodb.DBCollection.save(DBCollection.java:1049) at com.mongodb.DBCollection.save(DBCollection.java:1014) at com.mongodb.hadoop.output.MongoRecordWriter.write(MongoRecordWriter.java:105) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1000) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Before Spark 1.3.0 this would result in the application crashing, but now the data just remains unprocessed. There is no close instruction at any part of the code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6732) Scala existentials warning during compilation
[ https://issues.apache.org/jira/browse/SPARK-6732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6732. -- Resolution: Duplicate Scala existentials warning during compilation - Key: SPARK-6732 URL: https://issues.apache.org/jira/browse/SPARK-6732 Project: Spark Issue Type: Improvement Components: Scheduler Environment: operating system: OSX Yosemite scala version: 2.10.4 hardware: 2.7 GHz Intel Core i7, 16 GB 1600 MHz DDR3 Reporter: Raymond Tay Priority: Minor Certain parts of the Scala code was detected to have used existentials but the scala import can be included in the source file to prevent such warnings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6740) SQL operator and condition precedence is not honoured
Santiago M. Mola created SPARK-6740: --- Summary: SQL operator and condition precedence is not honoured Key: SPARK-6740 URL: https://issues.apache.org/jira/browse/SPARK-6740 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola The following query from the SQL Logic Test suite fails to parse: SELECT DISTINCT * FROM t1 AS cor0 WHERE NOT ( - _2 + - 39 ) IS NULL while the following (equivalent) does parse correctly: SELECT DISTINCT * FROM t1 AS cor0 WHERE NOT (( - _2 + - 39 ) IS NULL) SQLite, MySQL and Oracle (and probably most SQL implementations) define IS with higher precedence than NOT, so the first query is valid and well-defined. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6695) Add an external iterator: a hadoop-like output collector
[ https://issues.apache.org/jira/browse/SPARK-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6695. -- Resolution: Won't Fix I suppose my problem with that is that it would be duplicating Spark's spill mechanism and leaves open the questions I put forth above about cleanup. Spark functions aren't supposed to need a huge amount of memory all at once, and so I imagine the solution in every case is to redesign the method. Add an external iterator: a hadoop-like output collector Key: SPARK-6695 URL: https://issues.apache.org/jira/browse/SPARK-6695 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: uncleGen In practical use, we usually need to create a big iterator, which means too big in `memory usage` or too long in `array size`. On the one hand, it leads to too much memory consumption. On the other hand, one `Array` may not hold all the elements, as java array indices are of type 'int' (4 bytes or 32 bits). So, IMHO, we may provide a `collector`, which has a buffer, 100MB or any others, and could spill data into disk. The use case may like: {code: borderStyle=solid} rdd.mapPartition { it = ... val collector = new ExternalCollector() collector.collect(a) ... collector.iterator } {code} I have done some related works, and I need your opinions, thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482987#comment-14482987 ] Sean Owen commented on SPARK-6738: -- Is that the only file spilled though? I'm not an expert but it looks like lots of files are spilled to here. EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 1100M data to disk: 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.8 MB to disk (15 times so far) But the file size is only 1.1M. [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b -rw-r- 1 spark users 1.1M Apr 7 16:39 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b The estimateSize is hugh difference with spill file size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6738: - Description: ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.8 MB to disk (15 times so far) {code} But the file size is only 1.1M. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b -rw-r- 1 spark users 1.1M Apr 7 16:39 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b {code} The estimateSize is hugh difference with spill file size was: ExternalAppendOnlyMap spill 1100M data to disk: 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.8 MB to disk (15 times so far) But the file size is only 1.1M. [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b -rw-r- 1 spark users 1.1M Apr 7 16:39 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b The estimateSize is hugh difference with spill file size (Please use the code tag to format output) EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO
[jira] [Updated] (SPARK-6733) Suppression of usage of Scala existential code should be done
[ https://issues.apache.org/jira/browse/SPARK-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6733: - Priority: Trivial (was: Major) What file are you talking about? Suppression of usage of Scala existential code should be done - Key: SPARK-6733 URL: https://issues.apache.org/jira/browse/SPARK-6733 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 1.3.0 Environment: OS: OSX Yosemite Hardware: Intel Core i7 with 16 GB RAM Reporter: Raymond Tay Priority: Trivial The inclusion of this statement in the file {code:scala} import scala.language.existentials {code} should have suppressed all warnings regarding the use of scala existential code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6739) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types
[ https://issues.apache.org/jira/browse/SPARK-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483034#comment-14483034 ] Tijo Thomas commented on SPARK-6739: Please close this duplicate issue Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types Key: SPARK-6739 URL: https://issues.apache.org/jira/browse/SPARK-6739 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Tijo Thomas Priority: Trivial Missing import in example script under the section Programmatically Specifying the Schema scala val schema = | StructType( | schemaString.split( ).map(fieldName = StructField(fieldName, StringType, true))) console:25: error: not found: value StructType StructType( ^ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6739) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types
[ https://issues.apache.org/jira/browse/SPARK-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483034#comment-14483034 ] Tijo Thomas edited comment on SPARK-6739 at 4/7/15 11:26 AM: - Please close this duplicate issue How ever the previous fix is not reflecting in the documentation under section Programmatically Specifying the Schema :: https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options was (Author: tijo paracka): Please close this duplicate issue How ever the previous fix is not reflecting in the documentation. Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types Key: SPARK-6739 URL: https://issues.apache.org/jira/browse/SPARK-6739 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Tijo Thomas Priority: Trivial Missing import in example script under the section Programmatically Specifying the Schema scala val schema = | StructType( | schemaString.split( ).map(fieldName = StructField(fieldName, StringType, true))) console:25: error: not found: value StructType StructType( ^ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6739) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types
[ https://issues.apache.org/jira/browse/SPARK-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6739. -- Resolution: Not A Problem It's because the site hasn't been published again with a next release since the fix. Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types Key: SPARK-6739 URL: https://issues.apache.org/jira/browse/SPARK-6739 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Tijo Thomas Priority: Trivial Missing import in example script under the section Programmatically Specifying the Schema scala val schema = | StructType( | schemaString.split( ).map(fieldName = StructField(fieldName, StringType, true))) console:25: error: not found: value StructType StructType( ^ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6738: - Description: ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.8 MB to disk (15 times so far) {code} But the file size is only 1.1M. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b -rw-r- 1 spark users 1.1M Apr 7 16:39 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b {code} Here is the other spilled file. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/* /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0 -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/24: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_7a3619cf-20e1-4815-8b1f-faeefde40d73 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/2e: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_1e366a65-ced7-4b17-a085-5f873ff6dc43 -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_6772b815-11ee-413a-bc6c-d0dc9cbffc51 {code} The estimateSize is hugh difference with spill file size was: ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far)
[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483011#comment-14483011 ] Hong Shen commented on SPARK-6738: -- Yes, it spill lots of files, but each one has only 1.1M. EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.8 MB to disk (15 times so far) {code} But the file size is only 1.1M. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b -rw-r- 1 spark users 1.1M Apr 7 16:39 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b {code} Here is the other spilled file. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/* /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0 -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/24: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_7a3619cf-20e1-4815-8b1f-faeefde40d73 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/2e: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_1e366a65-ced7-4b17-a085-5f873ff6dc43 -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_6772b815-11ee-413a-bc6c-d0dc9cbffc51 {code} The estimateSize is hugh difference with spill file size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6738: - Description: ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.8 MB to disk (15 times so far) {code} But the file size is only 1.1M. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b -rw-r- 1 spark users 1.1M Apr 7 16:39 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b {code} Here are the other spilled files. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/* /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0 -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/24: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_7a3619cf-20e1-4815-8b1f-faeefde40d73 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/2e: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_1e366a65-ced7-4b17-a085-5f873ff6dc43 -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_6772b815-11ee-413a-bc6c-d0dc9cbffc51 {code} The estimateSize is hugh difference with spill file size was: ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far)
[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6738: - Description: ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.8 MB to disk (15 times so far) {code} But the file size is only 1.1M. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b -rw-r- 1 spark users 1.1M Apr 7 16:39 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b {code} Here is the other spilled files. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/* /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0 -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/24: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_7a3619cf-20e1-4815-8b1f-faeefde40d73 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/2e: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_1e366a65-ced7-4b17-a085-5f873ff6dc43 -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_6772b815-11ee-413a-bc6c-d0dc9cbffc51 {code} The estimateSize is hugh difference with spill file size was: ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far)
[jira] [Commented] (SPARK-6735) Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it.
[ https://issues.apache.org/jira/browse/SPARK-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483018#comment-14483018 ] Twinkle Sachdeva commented on SPARK-6735: - Created a PR here : https://github.com/twinkle-sachdeva/spark/pull/1 Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it. --- Key: SPARK-6735 URL: https://issues.apache.org/jira/browse/SPARK-6735 Project: Spark Issue Type: Improvement Components: Spark Submit, YARN Affects Versions: 1.2.0, 1.2.1, 1.3.0 Reporter: Twinkle Sachdeva Currently there is a setting (spark.yarn.max.executor.failures ) which tells maximum number of executor failures, after which Application fails. For long running applications, user can require not to kill the application at all or will require such setting relative to a window duration. This improvement is ti provide such options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483025#comment-14483025 ] Sean Owen commented on SPARK-6738: -- Do you observe a problem? is it possible that you are looking at unserialized objects in memory but serialized representation on disk? what is the nature of the data? More info would be much more helpful EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.8 MB to disk (15 times so far) {code} But the file size is only 1.1M. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b -rw-r- 1 spark users 1.1M Apr 7 16:39 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b {code} Here are the other spilled files. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/* /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0 -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/24: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_7a3619cf-20e1-4815-8b1f-faeefde40d73 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/2e: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_1e366a65-ced7-4b17-a085-5f873ff6dc43 -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_6772b815-11ee-413a-bc6c-d0dc9cbffc51 {code} The estimateSize is hugh difference with spill file size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail:
[jira] [Comment Edited] (SPARK-6739) Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types
[ https://issues.apache.org/jira/browse/SPARK-6739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483034#comment-14483034 ] Tijo Thomas edited comment on SPARK-6739 at 4/7/15 11:25 AM: - Please close this duplicate issue How ever the previous fix is not reflecting in the documentation. was (Author: tijo paracka): Please close this duplicate issue Spark SQL Example gives errors due to missing import of Types org.apache.spark.sql.types Key: SPARK-6739 URL: https://issues.apache.org/jira/browse/SPARK-6739 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Tijo Thomas Priority: Trivial Missing import in example script under the section Programmatically Specifying the Schema scala val schema = | StructType( | schemaString.split( ).map(fieldName = StructField(fieldName, StringType, true))) console:25: error: not found: value StructType StructType( ^ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6733) Suppression of usage of Scala existential code should be done
[ https://issues.apache.org/jira/browse/SPARK-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483036#comment-14483036 ] Raymond Tay commented on SPARK-6733: Apologize. It's DAGScheduler.scala Suppression of usage of Scala existential code should be done - Key: SPARK-6733 URL: https://issues.apache.org/jira/browse/SPARK-6733 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 1.3.0 Environment: OS: OSX Yosemite Hardware: Intel Core i7 with 16 GB RAM Reporter: Raymond Tay Priority: Trivial The inclusion of this statement in the file {code:scala} import scala.language.existentials {code} should have suppressed all warnings regarding the use of scala existential code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6420) Driver's Block Manager does not use spark.driver.host in Yarn-Client mode
[ https://issues.apache.org/jira/browse/SPARK-6420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6420. -- Resolution: Duplicate Driver's Block Manager does not use spark.driver.host in Yarn-Client mode --- Key: SPARK-6420 URL: https://issues.apache.org/jira/browse/SPARK-6420 Project: Spark Issue Type: Bug Components: Block Manager Reporter: Liangliang Gu In my cluster, the yarn node does not know the client's host name. So I set spark.driver.host to the ip address of the client. But the driver's Block Manager does not use spark.driver.host but the hostname in Yarn-Client mode. I got the following error: TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2, hadoop-node1538098): java.io.IOException: Failed to connect to example-hostname at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:191) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:78) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:127) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:644) at io.netty.channel.socket.nio.NioSocketChannel.doConnect(NioSocketChannel.java:193) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.connect(AbstractNioChannel.java:200) at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1029) at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496) at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481) at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47) at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:496) at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:481) at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:463) at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:849) at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:199) at io.netty.bootstrap.Bootstrap$2.run(Bootstrap.java:165) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ... 1 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6741) Add support for SELECT ALL syntax
Santiago M. Mola created SPARK-6741: --- Summary: Add support for SELECT ALL syntax Key: SPARK-6741 URL: https://issues.apache.org/jira/browse/SPARK-6741 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola Priority: Minor Support SELECT ALL syntax (equivalent to SELECT, without DISTINCT). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6742) Spark pushes down filters in old parquet path that reference partitioning columns
[ https://issues.apache.org/jira/browse/SPARK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yash Datta updated SPARK-6742: -- This is same as SPARK-6554 for new parquet path Spark pushes down filters in old parquet path that reference partitioning columns - Key: SPARK-6742 URL: https://issues.apache.org/jira/browse/SPARK-6742 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Yash Datta Create a table with multiple fields partitioned on 'market' column. run a query like : SELECT start_sp_time, end_sp_time, imsi, imei, enb_common_enbid FROM csl_data_parquet WHERE (((technology = 'FDD') AND (bandclass = '800') AND (region = 'R15') AND (market = 'LA metro')) OR ((technology = 'FDD') AND (bandclass = '1900') AND (region = 'R15') AND (market = 'Indianapolis'))) AND start_sp_time = 1.4158368E9 AND end_sp_time 1.4159232E9 AND dt = '2014-11-13-00-00' AND dt '2014-11-14-00-00' ORDER BY end_sp_time DESC LIMIT 100 The or filter is pushed down in this case , resulting in column not found exception from parquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6742) Spark pushes down filters in old parquet path that reference partitioning columns
Yash Datta created SPARK-6742: - Summary: Spark pushes down filters in old parquet path that reference partitioning columns Key: SPARK-6742 URL: https://issues.apache.org/jira/browse/SPARK-6742 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Yash Datta Create a table with multiple fields partitioned on 'market' column. run a query like : SELECT start_sp_time, end_sp_time, imsi, imei, enb_common_enbid FROM csl_data_parquet WHERE (((technology = 'FDD') AND (bandclass = '800') AND (region = 'R15') AND (market = 'LA metro')) OR ((technology = 'FDD') AND (bandclass = '1900') AND (region = 'R15') AND (market = 'Indianapolis'))) AND start_sp_time = 1.4158368E9 AND end_sp_time 1.4159232E9 AND dt = '2014-11-13-00-00' AND dt '2014-11-14-00-00' ORDER BY end_sp_time DESC LIMIT 100 The or filter is pushed down in this case , resulting in column not found exception from parquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6742) Spark pushes down filters in old parquet path that reference partitioning columns
[ https://issues.apache.org/jira/browse/SPARK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6742: --- Assignee: Apache Spark Spark pushes down filters in old parquet path that reference partitioning columns - Key: SPARK-6742 URL: https://issues.apache.org/jira/browse/SPARK-6742 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Yash Datta Assignee: Apache Spark Create a table with multiple fields partitioned on 'market' column. run a query like : SELECT start_sp_time, end_sp_time, imsi, imei, enb_common_enbid FROM csl_data_parquet WHERE (((technology = 'FDD') AND (bandclass = '800') AND (region = 'R15') AND (market = 'LA metro')) OR ((technology = 'FDD') AND (bandclass = '1900') AND (region = 'R15') AND (market = 'Indianapolis'))) AND start_sp_time = 1.4158368E9 AND end_sp_time 1.4159232E9 AND dt = '2014-11-13-00-00' AND dt '2014-11-14-00-00' ORDER BY end_sp_time DESC LIMIT 100 The or filter is pushed down in this case , resulting in column not found exception from parquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6742) Spark pushes down filters in old parquet path that reference partitioning columns
[ https://issues.apache.org/jira/browse/SPARK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6742: --- Assignee: (was: Apache Spark) Spark pushes down filters in old parquet path that reference partitioning columns - Key: SPARK-6742 URL: https://issues.apache.org/jira/browse/SPARK-6742 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Yash Datta Create a table with multiple fields partitioned on 'market' column. run a query like : SELECT start_sp_time, end_sp_time, imsi, imei, enb_common_enbid FROM csl_data_parquet WHERE (((technology = 'FDD') AND (bandclass = '800') AND (region = 'R15') AND (market = 'LA metro')) OR ((technology = 'FDD') AND (bandclass = '1900') AND (region = 'R15') AND (market = 'Indianapolis'))) AND start_sp_time = 1.4158368E9 AND end_sp_time 1.4159232E9 AND dt = '2014-11-13-00-00' AND dt '2014-11-14-00-00' ORDER BY end_sp_time DESC LIMIT 100 The or filter is pushed down in this case , resulting in column not found exception from parquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6742) Spark pushes down filters in old parquet path that reference partitioning columns
[ https://issues.apache.org/jira/browse/SPARK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483095#comment-14483095 ] Apache Spark commented on SPARK-6742: - User 'saucam' has created a pull request for this issue: https://github.com/apache/spark/pull/5390 Spark pushes down filters in old parquet path that reference partitioning columns - Key: SPARK-6742 URL: https://issues.apache.org/jira/browse/SPARK-6742 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Yash Datta Create a table with multiple fields partitioned on 'market' column. run a query like : SELECT start_sp_time, end_sp_time, imsi, imei, enb_common_enbid FROM csl_data_parquet WHERE (((technology = 'FDD') AND (bandclass = '800') AND (region = 'R15') AND (market = 'LA metro')) OR ((technology = 'FDD') AND (bandclass = '1900') AND (region = 'R15') AND (market = 'Indianapolis'))) AND start_sp_time = 1.4158368E9 AND end_sp_time 1.4159232E9 AND dt = '2014-11-13-00-00' AND dt '2014-11-14-00-00' ORDER BY end_sp_time DESC LIMIT 100 The or filter is pushed down in this case , resulting in column not found exception from parquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6738: - Description: ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size was: ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.8 MB to disk (15 times so far) {code} But the file size is only 1.1M. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b -rw-r- 1 spark users 1.1M Apr 7 16:39 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b {code} Here are the other spilled files. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/* /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0 -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9
[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483103#comment-14483103 ] Sean Owen commented on SPARK-6738: -- To be clear I am asking how big the data being spilled is in memory. The GC state isnt relevant. That is, are they just compressing 10x on serialization into the files you see? It is not crazy. EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483104#comment-14483104 ] Hong Shen commented on SPARK-6738: -- I don't think it's serialized cause the problem. the input data is a hive table, and the spark job is a spark SQL. In the fact, when the log show that spilling in-memory map of 2.2 GB to disk, the file is only 2.2M, and the GC log show the jvm is less than 1GB. the estimateSize also deviation with the jvm memory. EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6612) Python KMeans parity
[ https://issues.apache.org/jira/browse/SPARK-6612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483109#comment-14483109 ] Apache Spark commented on SPARK-6612: - User 'FlytxtRnD' has created a pull request for this issue: https://github.com/apache/spark/pull/5391 Python KMeans parity Key: SPARK-6612 URL: https://issues.apache.org/jira/browse/SPARK-6612 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Hrishikesh Priority: Minor This is a subtask of [SPARK-6258] for the Python API of KMeans. These items are missing: KMeans * setEpsilon * setInitializationSteps KMeansModel * computeCost * k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6612) Python KMeans parity
[ https://issues.apache.org/jira/browse/SPARK-6612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6612: --- Assignee: Hrishikesh (was: Apache Spark) Python KMeans parity Key: SPARK-6612 URL: https://issues.apache.org/jira/browse/SPARK-6612 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Hrishikesh Priority: Minor This is a subtask of [SPARK-6258] for the Python API of KMeans. These items are missing: KMeans * setEpsilon * setInitializationSteps KMeansModel * computeCost * k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6612) Python KMeans parity
[ https://issues.apache.org/jira/browse/SPARK-6612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6612: --- Assignee: Apache Spark (was: Hrishikesh) Python KMeans parity Key: SPARK-6612 URL: https://issues.apache.org/jira/browse/SPARK-6612 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Apache Spark Priority: Minor This is a subtask of [SPARK-6258] for the Python API of KMeans. These items are missing: KMeans * setEpsilon * setInitializationSteps KMeansModel * computeCost * k -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6743) Join with empty projection on one side produces invalid results
[ https://issues.apache.org/jira/browse/SPARK-6743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santiago M. Mola updated SPARK-6743: Priority: Critical (was: Major) Join with empty projection on one side produces invalid results --- Key: SPARK-6743 URL: https://issues.apache.org/jira/browse/SPARK-6743 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola Priority: Critical {code:java} val sqlContext = new SQLContext(sc) val tab0 = sc.parallelize(Seq( (83,0,38), (26,0,79), (43,81,24) )) sqlContext.registerDataFrameAsTable(sqlContext.createDataFrame(tab0), tab0) sqlContext.cacheTable(tab0) val df1 = sqlContext.sql(SELECT tab0._2, cor0._2 FROM tab0, tab0 cor0 GROUP BY tab0._2, cor0._2) val result1 = df1.collect() val df2 = sqlContext.sql(SELECT cor0._2 FROM tab0, tab0 cor0 GROUP BY cor0._2) val result2 = df2.collect() val df3 = sqlContext.sql(SELECT cor0._2 FROM tab0 cor0 GROUP BY cor0._2) val result3 = df3.collect() {code} Given the previous code, result2 equals to Row(43), Row(83), Row(26), which is wrong. These results correspond to cor0._1, instead of cor0._2. Correct results would be Row(0), Row(81), which are ok for the third query. The first query also produces valid results, and the only difference is that the left side of the join is not empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6743) Join with empty projection on one side produces invalid results
Santiago M. Mola created SPARK-6743: --- Summary: Join with empty projection on one side produces invalid results Key: SPARK-6743 URL: https://issues.apache.org/jira/browse/SPARK-6743 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Santiago M. Mola {code:java} val sqlContext = new SQLContext(sc) val tab0 = sc.parallelize(Seq( (83,0,38), (26,0,79), (43,81,24) )) sqlContext.registerDataFrameAsTable(sqlContext.createDataFrame(tab0), tab0) sqlContext.cacheTable(tab0) val df1 = sqlContext.sql(SELECT tab0._2, cor0._2 FROM tab0, tab0 cor0 GROUP BY tab0._2, cor0._2) val result1 = df1.collect() val df2 = sqlContext.sql(SELECT cor0._2 FROM tab0, tab0 cor0 GROUP BY cor0._2) val result2 = df2.collect() val df3 = sqlContext.sql(SELECT cor0._2 FROM tab0 cor0 GROUP BY cor0._2) val result3 = df3.collect() {code} Given the previous code, result2 equals to Row(43), Row(83), Row(26), which is wrong. These results correspond to cor0._1, instead of cor0._2. Correct results would be Row(0), Row(81), which are ok for the third query. The first query also produces valid results, and the only difference is that the left side of the join is not empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3276) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming
[ https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483174#comment-14483174 ] Emre Sevinç edited comment on SPARK-3276 at 4/7/15 2:36 PM: [~srowen] would it be fine if I added a public API method on FileInputDStream class that takes a single parameter (duration) and sets the value of {{MIN_REMEMBER_DURATION}} to that value? And of course, at the same time changing MIN_REMEMBER_DURATION from a constant into a variable, with a default value of 1 minute (that is the currently hard-coded value). Or, as an alternative to achieve the similar effect: Create another Spark configuration property (with a default value of 1 minute) and re-factor the code so that (the new) {{minRememberDuration}} variable takes its value from that property. Right now, I have no idea which of the above two approaches is more meaningful / idiomatic. Any comments? was (Author: emres): [~srowen] would it be fine if I added a public API method on FileInputDStream class that takes a single parameter (duration) and sets the value of MIN_REMEMBER_DURATION to that value? And of course, at the same time changing MIN_REMEMBER_DURATION from a constant into a variable, with a default value of 1 minute (that is the currently hard-coded value). Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming -- Key: SPARK-3276 URL: https://issues.apache.org/jira/browse/SPARK-3276 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.2.0 Reporter: Jack Hu Priority: Minor Currently, only one API called textFileStream in StreamingContext to specify the text file dstream, which ignores the old files always. On some times, the old files is still useful. Need a API to let user choose whether the old files need to be ingored or not . The API currently in StreamingContext: def textFileStream(directory: String): DStream[String] = { fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3591) Provide fire and forget option for YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-3591. -- Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Tao Wang Target Version/s: 1.4.0 (was: 1.2.0) Provide fire and forget option for YARN cluster mode -- Key: SPARK-3591 URL: https://issues.apache.org/jira/browse/SPARK-3591 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Tao Wang Fix For: 1.4.0 After launching an application through yarn-cluster mode, the SparkSubmit process sticks around and enters a monitoring loop to track the application's status. This is really a responsibility that belongs to a different process, such that SparkSubmit can run yarn-cluster applications in a fire-and-forget mode. We currently already do this for standalone-cluster mode. we should do it for yarn-cluster mode too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6602) Replace direct use of Akka with Spark RPC interface
[ https://issues.apache.org/jira/browse/SPARK-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483275#comment-14483275 ] Apache Spark commented on SPARK-6602: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/5392 Replace direct use of Akka with Spark RPC interface --- Key: SPARK-6602 URL: https://issues.apache.org/jira/browse/SPARK-6602 Project: Spark Issue Type: Sub-task Components: Spark Core Reporter: Reynold Xin Assignee: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3276) Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming
[ https://issues.apache.org/jira/browse/SPARK-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483174#comment-14483174 ] Emre Sevinç commented on SPARK-3276: [~srowen] would it be fine if I added a public API method on FileInputDStream class that takes a single parameter (duration) and sets the value of MIN_REMEMBER_DURATION to that value? And of course, at the same time changing MIN_REMEMBER_DURATION from a constant into a variable, with a default value of 1 minute (that is the currently hard-coded value). Provide a API to specify MIN_REMEMBER_DURATION for files to consider as input in streaming -- Key: SPARK-3276 URL: https://issues.apache.org/jira/browse/SPARK-3276 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.2.0 Reporter: Jack Hu Priority: Minor Currently, only one API called textFileStream in StreamingContext to specify the text file dstream, which ignores the old files always. On some times, the old files is still useful. Need a API to let user choose whether the old files need to be ingored or not . The API currently in StreamingContext: def textFileStream(directory: String): DStream[String] = { fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6744) Add support for CROSS JOIN syntax
Santiago M. Mola created SPARK-6744: --- Summary: Add support for CROSS JOIN syntax Key: SPARK-6744 URL: https://issues.apache.org/jira/browse/SPARK-6744 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Environment: Add support for the standard CROSS JOIN syntax. Reporter: Santiago M. Mola Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5242) ec2/spark_ec2.py lauch does not work with VPC if no public DNS or IP is available
[ https://issues.apache.org/jira/browse/SPARK-5242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483164#comment-14483164 ] Apache Spark commented on SPARK-5242: - User 'mdagost' has created a pull request for this issue: https://github.com/apache/spark/pull/5244 ec2/spark_ec2.py lauch does not work with VPC if no public DNS or IP is available --- Key: SPARK-5242 URL: https://issues.apache.org/jira/browse/SPARK-5242 Project: Spark Issue Type: Bug Components: EC2 Reporter: Vladimir Grigor Labels: easyfix How to reproduce: user starting cluster in VPC needs to wait forever: {code} ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 --subnet-id=subnet-2571dd4d --zone=eu-west-1a launch SparkByScript Setting up security groups... Searching for existing cluster SparkByScript... Spark AMI: ami-1ae0166d Launching instances... Launched 1 slaves in eu-west-1a, regid = r-e70c5502 Launched master in eu-west-1a, regid = r-bf0f565a Waiting for cluster to enter 'ssh-ready' state..{forever} {code} Problem is that current code makes wrong assumption that VPC instance has public_dns_name or public ip_address. Actually more common is that VPC instance has only private_ip_address. The bug is already fixed in my fork, I am going to submit pull request -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6747) Support List as a return type in Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-6747: Summary: Support List as a return type in Hive UDF (was: Support List as a return type in Hive UDF) Support List as a return type in Hive UDF --- Key: SPARK-6747 URL: https://issues.apache.org/jira/browse/SPARK-6747 Project: Spark Issue Type: Bug Components: SQL Reporter: Takeshi Yamamuro The current implementation can't handle List as a return type in Hive UDF. We assume an UDF below; public class UDFToListString extends UDF { public ListString evaluate(Object o) { return Arrays.asList(xxx, yyy, zzz); } } An exception of scala.MatchError is thrown as follows when the UDF used; scala.MatchError: interface java.util.List (of class java.lang.Class) at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174) at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278) ... To fix this problem, we need to add an entry for List in HiveInspectors#javaClassToDataType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6746) Refactor large functions in DAGScheduler to improve readibility
Ilya Ganelin created SPARK-6746: --- Summary: Refactor large functions in DAGScheduler to improve readibility Key: SPARK-6746 URL: https://issues.apache.org/jira/browse/SPARK-6746 Project: Spark Issue Type: Sub-task Reporter: Ilya Ganelin The DAGScheduler class contains two huge functions that make it very hard to understand what's going on in the code. These are: 1) The monolithic handleTaskCompletion 2) The cleanupStateForJobAndIndependentStages function These can be simply modularized to eliminate some awkward type casting and improve code readability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6747) Support List as a return type in Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-6747: Description: The current implementation can't handle List as a return type in Hive UDF. We assume an UDF below; public class UDFToListString extends UDF { public ListString evaluate(Object o) { return Arrays.asList(xxx, yyy, zzz); } } An exception of scala.MatchError is thrown as follows when the UDF used; scala.MatchError: interface java.util.List (of class java.lang.Class) at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174) at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278) ... To fix this problem, we need to add an entry for List in HiveInspectors#javaClassToDataType. However, it has one difficulty because of type erasure in JVM. We assume that lines below are appended in HiveInspectors#javaClassToDataType; // list type case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] = val tpe = c.getGenericInterfaces()(0).asInstanceOf[ParameterizedType] println(tpe.getActualTypeArguments()(0).toString()) = 'E' This logic fails to catch a component type in List. was: The current implementation can't handle List as a return type in Hive UDF. We assume an UDF below; public class UDFToListString extends UDF { public ListString evaluate(Object o) { return Arrays.asList(xxx, yyy, zzz); } } An exception of scala.MatchError is thrown as follows when the UDF used; scala.MatchError: interface java.util.List (of class java.lang.Class) at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174) at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278) ... To fix this problem, we need to add an entry for List in HiveInspectors#javaClassToDataType. Support List as a return type in Hive UDF --- Key: SPARK-6747 URL: https://issues.apache.org/jira/browse/SPARK-6747 Project: Spark Issue Type: Bug Components: SQL Reporter: Takeshi Yamamuro The current implementation can't handle List as a return type in Hive UDF. We assume an UDF below; public class UDFToListString extends UDF { public ListString evaluate(Object o) { return Arrays.asList(xxx, yyy, zzz); } } An exception of scala.MatchError is thrown as follows when the UDF used; scala.MatchError: interface java.util.List (of class java.lang.Class) at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174) at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at
[jira] [Updated] (SPARK-6747) Support List as a return type in Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-6747: Description: The current implementation can't handle List as a return type in Hive UDF. We assume an UDF below; public class UDFToListString extends UDF { public ListString evaluate(Object o) { return Arrays.asList(xxx, yyy, zzz); } } An exception of scala.MatchError is thrown as follows when the UDF used; scala.MatchError: interface java.util.List (of class java.lang.Class) at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174) at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278) ... To fix this problem, we need to add an entry for List in HiveInspectors#javaClassToDataType. was: The current implementation can't handle List as a return type in Hive UDF. We assume an UDF below; public class UDFToListString extends UDF { public ListString evaluate(Object o) { return Arrays.asList(xxx, yyy, zzz); } } An exception of scala.MatchError is thrown as follows when the UDF used; scala.MatchError: interface java.util.List (of class java.lang.Class) at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174) at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278) ... Support List as a return type in Hive UDF - Key: SPARK-6747 URL: https://issues.apache.org/jira/browse/SPARK-6747 Project: Spark Issue Type: Bug Components: SQL Reporter: Takeshi Yamamuro The current implementation can't handle List as a return type in Hive UDF. We assume an UDF below; public class UDFToListString extends UDF { public ListString evaluate(Object o) { return Arrays.asList(xxx, yyy, zzz); } } An exception of scala.MatchError is thrown as follows when the UDF used; scala.MatchError: interface java.util.List (of class java.lang.Class) at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174) at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278) ... To fix this problem, we need to add an entry for List in HiveInspectors#javaClassToDataType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6745) Develop a general filter function to be used in PrunedFilteredScan and CatalystScan
Alex Liu created SPARK-6745: --- Summary: Develop a general filter function to be used in PrunedFilteredScan and CatalystScan Key: SPARK-6745 URL: https://issues.apache.org/jira/browse/SPARK-6745 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Alex Liu During integrating PrunedFilteredScan/CatalystScan with Cassandra. Some filters are pushed down to Cassandra Spark connector, others are left out. The left out filters need to applied to the Rdd[Row] obtained from Cassandra Spark Connector. Rdd[Row].filter(generalFilter) {code} def filter(f: T = Boolean): RDD[T] generalFilter: Row = Boolean {code} Where generalFilter combines filters left and applies them to Row. This is a general function which could be applied to many data source integration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6747) Support List as a return type in Hive UDF
Takeshi Yamamuro created SPARK-6747: --- Summary: Support List as a return type in Hive UDF Key: SPARK-6747 URL: https://issues.apache.org/jira/browse/SPARK-6747 Project: Spark Issue Type: Bug Components: SQL Reporter: Takeshi Yamamuro The current implementation can't handle List as a return type in Hive UDF. We assume an UDF below; public class UDFToListString extends UDF { public ListString evaluate(Object o) { return Arrays.asList(xxx, yyy, zzz); } } An exception of scala.MatchError is thrown as follows when the UDF used; scala.MatchError: interface java.util.List (of class java.lang.Class) at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174) at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278) ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5114) Should Evaluator be a PipelineStage
[ https://issues.apache.org/jira/browse/SPARK-5114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483335#comment-14483335 ] Peter Rudenko commented on SPARK-5114: -- +1 for should. For my use case (create pipeline from config file, sometimes there is need to do evaluation with custom metrics (e.g. gini norm, etc.), sometimes there's no need to do evaluation, it would be done on other part of the system). Would be more flexible for if evaluator would be a part of pipeline. Should Evaluator be a PipelineStage --- Key: SPARK-5114 URL: https://issues.apache.org/jira/browse/SPARK-5114 Project: Spark Issue Type: Question Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Pipelines can currently contain Estimators and Transformers. Question for debate: Should Pipelines be able to contain Evaluators? Pros: * Evaluators take input datasets with particular schema, which should perhaps be checked before running a Pipeline. Cons: * Evaluators do not transform datasets. They produce a scalar (or a few values), which makes it hard to say how they fit into a Pipeline or a PipelineModel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6746) Refactor large functions in DAGScheduler to improve readibility
[ https://issues.apache.org/jira/browse/SPARK-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6746: --- Assignee: Apache Spark Refactor large functions in DAGScheduler to improve readibility --- Key: SPARK-6746 URL: https://issues.apache.org/jira/browse/SPARK-6746 Project: Spark Issue Type: Sub-task Components: Scheduler Reporter: Ilya Ganelin Assignee: Apache Spark The DAGScheduler class contains two huge functions that make it very hard to understand what's going on in the code. These are: 1) The monolithic handleTaskCompletion 2) The cleanupStateForJobAndIndependentStages function These can be simply modularized to eliminate some awkward type casting and improve code readability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5818) unable to use add jar in hql
[ https://issues.apache.org/jira/browse/SPARK-5818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5818: --- Assignee: (was: Apache Spark) unable to use add jar in hql -- Key: SPARK-5818 URL: https://issues.apache.org/jira/browse/SPARK-5818 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1 Reporter: pengxu In the spark 1.2.1 and 1.2.0, it's unable the use the hive command add jar in hql. It seems that the problem in spark-2219 is still existed. the problem can be reproduced as described in the below. Suppose the jar file is named brickhouse-0.6.0.jar and is placed in the /tmp directory {code} spark-shellimport org.apache.spark.sql.hive._ spark-shellval sqlContext = new HiveContext(sc) spark-shellimport sqlContext._ spark-shellhql(add jar /tmp/brickhouse-0.6.0.jar) {code} the error message is showed as blow {code:title=Error Log} 15/02/15 01:36:31 ERROR SessionState: Unable to register /tmp/brickhouse-0.6.0.jar Exception: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to java.net.URLClassLoader java.lang.ClassCastException: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to java.net.URLClassLoader at org.apache.hadoop.hive.ql.exec.Utilities.addToClassPath(Utilities.java:1921) at org.apache.hadoop.hive.ql.session.SessionState.registerJar(SessionState.java:599) at org.apache.hadoop.hive.ql.session.SessionState$ResourceType$2.preHook(SessionState.java:658) at org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:732) at org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:717) at org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:54) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:319) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276) at org.apache.spark.sql.hive.execution.AddJar.sideEffectResult$lzycompute(commands.scala:74) at org.apache.spark.sql.hive.execution.AddJar.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46) at org.apache.spark.sql.hive.execution.AddJar.execute(commands.scala:68) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108) at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:102) at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:106) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:29) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:37) at $line30.$read$$iwC$$iwC$$iwC$$iwC.init(console:39) at $line30.$read$$iwC$$iwC$$iwC.init(console:41) at $line30.$read$$iwC$$iwC.init(console:43) at $line30.$read$$iwC.init(console:45) at $line30.$read.init(console:47) at $line30.$read$.init(console:51) at $line30.$read$.clinit(console) at $line30.$eval$.init(console:7) at $line30.$eval$.clinit(console) at $line30.$eval.$print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785) at
[jira] [Commented] (SPARK-5818) unable to use add jar in hql
[ https://issues.apache.org/jira/browse/SPARK-5818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483376#comment-14483376 ] Apache Spark commented on SPARK-5818: - User 'gvramana' has created a pull request for this issue: https://github.com/apache/spark/pull/5393 unable to use add jar in hql -- Key: SPARK-5818 URL: https://issues.apache.org/jira/browse/SPARK-5818 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1 Reporter: pengxu In the spark 1.2.1 and 1.2.0, it's unable the use the hive command add jar in hql. It seems that the problem in spark-2219 is still existed. the problem can be reproduced as described in the below. Suppose the jar file is named brickhouse-0.6.0.jar and is placed in the /tmp directory {code} spark-shellimport org.apache.spark.sql.hive._ spark-shellval sqlContext = new HiveContext(sc) spark-shellimport sqlContext._ spark-shellhql(add jar /tmp/brickhouse-0.6.0.jar) {code} the error message is showed as blow {code:title=Error Log} 15/02/15 01:36:31 ERROR SessionState: Unable to register /tmp/brickhouse-0.6.0.jar Exception: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to java.net.URLClassLoader java.lang.ClassCastException: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to java.net.URLClassLoader at org.apache.hadoop.hive.ql.exec.Utilities.addToClassPath(Utilities.java:1921) at org.apache.hadoop.hive.ql.session.SessionState.registerJar(SessionState.java:599) at org.apache.hadoop.hive.ql.session.SessionState$ResourceType$2.preHook(SessionState.java:658) at org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:732) at org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:717) at org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:54) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:319) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276) at org.apache.spark.sql.hive.execution.AddJar.sideEffectResult$lzycompute(commands.scala:74) at org.apache.spark.sql.hive.execution.AddJar.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46) at org.apache.spark.sql.hive.execution.AddJar.execute(commands.scala:68) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108) at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:102) at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:106) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:29) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:37) at $line30.$read$$iwC$$iwC$$iwC$$iwC.init(console:39) at $line30.$read$$iwC$$iwC$$iwC.init(console:41) at $line30.$read$$iwC$$iwC.init(console:43) at $line30.$read$$iwC.init(console:45) at $line30.$read.init(console:47) at $line30.$read$.init(console:51) at $line30.$read$.clinit(console) at $line30.$eval$.init(console:7) at $line30.$eval$.clinit(console) at $line30.$eval.$print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873) at
[jira] [Assigned] (SPARK-5818) unable to use add jar in hql
[ https://issues.apache.org/jira/browse/SPARK-5818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5818: --- Assignee: Apache Spark unable to use add jar in hql -- Key: SPARK-5818 URL: https://issues.apache.org/jira/browse/SPARK-5818 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1 Reporter: pengxu Assignee: Apache Spark In the spark 1.2.1 and 1.2.0, it's unable the use the hive command add jar in hql. It seems that the problem in spark-2219 is still existed. the problem can be reproduced as described in the below. Suppose the jar file is named brickhouse-0.6.0.jar and is placed in the /tmp directory {code} spark-shellimport org.apache.spark.sql.hive._ spark-shellval sqlContext = new HiveContext(sc) spark-shellimport sqlContext._ spark-shellhql(add jar /tmp/brickhouse-0.6.0.jar) {code} the error message is showed as blow {code:title=Error Log} 15/02/15 01:36:31 ERROR SessionState: Unable to register /tmp/brickhouse-0.6.0.jar Exception: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to java.net.URLClassLoader java.lang.ClassCastException: org.apache.spark.repl.SparkIMain$TranslatingClassLoader cannot be cast to java.net.URLClassLoader at org.apache.hadoop.hive.ql.exec.Utilities.addToClassPath(Utilities.java:1921) at org.apache.hadoop.hive.ql.session.SessionState.registerJar(SessionState.java:599) at org.apache.hadoop.hive.ql.session.SessionState$ResourceType$2.preHook(SessionState.java:658) at org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:732) at org.apache.hadoop.hive.ql.session.SessionState.add_resource(SessionState.java:717) at org.apache.hadoop.hive.ql.processors.AddResourceProcessor.run(AddResourceProcessor.java:54) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:319) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276) at org.apache.spark.sql.hive.execution.AddJar.sideEffectResult$lzycompute(commands.scala:74) at org.apache.spark.sql.hive.execution.AddJar.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46) at org.apache.spark.sql.hive.execution.AddJar.execute(commands.scala:68) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108) at org.apache.spark.sql.hive.HiveContext.hiveql(HiveContext.scala:102) at org.apache.spark.sql.hive.HiveContext.hql(HiveContext.scala:106) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:29) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:35) at $line30.$read$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:37) at $line30.$read$$iwC$$iwC$$iwC$$iwC.init(console:39) at $line30.$read$$iwC$$iwC$$iwC.init(console:41) at $line30.$read$$iwC$$iwC.init(console:43) at $line30.$read$$iwC.init(console:45) at $line30.$read.init(console:47) at $line30.$read$.init(console:51) at $line30.$read$.clinit(console) at $line30.$eval$.init(console:7) at $line30.$eval$.clinit(console) at $line30.$eval.$print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785) at
[jira] [Assigned] (SPARK-6746) Refactor large functions in DAGScheduler to improve readibility
[ https://issues.apache.org/jira/browse/SPARK-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6746: --- Assignee: (was: Apache Spark) Refactor large functions in DAGScheduler to improve readibility --- Key: SPARK-6746 URL: https://issues.apache.org/jira/browse/SPARK-6746 Project: Spark Issue Type: Sub-task Components: Scheduler Reporter: Ilya Ganelin The DAGScheduler class contains two huge functions that make it very hard to understand what's going on in the code. These are: 1) The monolithic handleTaskCompletion 2) The cleanupStateForJobAndIndependentStages function These can be simply modularized to eliminate some awkward type casting and improve code readability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6746) Refactor large functions in DAGScheduler to improve readibility
[ https://issues.apache.org/jira/browse/SPARK-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483488#comment-14483488 ] Apache Spark commented on SPARK-6746: - User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/5394 Refactor large functions in DAGScheduler to improve readibility --- Key: SPARK-6746 URL: https://issues.apache.org/jira/browse/SPARK-6746 Project: Spark Issue Type: Sub-task Components: Scheduler Reporter: Ilya Ganelin The DAGScheduler class contains two huge functions that make it very hard to understand what's going on in the code. These are: 1) The monolithic handleTaskCompletion 2) The cleanupStateForJobAndIndependentStages function These can be simply modularized to eliminate some awkward type casting and improve code readability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6747) Support List as a return type in Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6747: --- Assignee: Apache Spark Support List as a return type in Hive UDF --- Key: SPARK-6747 URL: https://issues.apache.org/jira/browse/SPARK-6747 Project: Spark Issue Type: Bug Components: SQL Reporter: Takeshi Yamamuro Assignee: Apache Spark The current implementation can't handle List as a return type in Hive UDF. We assume an UDF below; public class UDFToListString extends UDF { public ListString evaluate(Object o) { return Arrays.asList(xxx, yyy, zzz); } } An exception of scala.MatchError is thrown as follows when the UDF used; scala.MatchError: interface java.util.List (of class java.lang.Class) at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174) at org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106) at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95) at org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278) ... To fix this problem, we need to add an entry for List in HiveInspectors#javaClassToDataType. However, it has one difficulty because of type erasure in JVM. We assume that lines below are appended in HiveInspectors#javaClassToDataType; // list type case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] = val tpe = c.getGenericInterfaces()(0).asInstanceOf[ParameterizedType] println(tpe.getActualTypeArguments()(0).toString()) = 'E' This logic fails to catch a component type in List. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6750) Upgrade ScalaStyle to 0.7
[ https://issues.apache.org/jira/browse/SPARK-6750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6750: --- Assignee: Apache Spark (was: Reynold Xin) Upgrade ScalaStyle to 0.7 - Key: SPARK-6750 URL: https://issues.apache.org/jira/browse/SPARK-6750 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Apache Spark 0.7 fixes a bug that's pretty useful, i.e. inline functions no longer return explicit type definition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6750) Upgrade ScalaStyle to 0.7
[ https://issues.apache.org/jira/browse/SPARK-6750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6750: --- Assignee: Reynold Xin (was: Apache Spark) Upgrade ScalaStyle to 0.7 - Key: SPARK-6750 URL: https://issues.apache.org/jira/browse/SPARK-6750 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin 0.7 fixes a bug that's pretty useful, i.e. inline functions no longer return explicit type definition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3219) K-Means clusterer should support Bregman distance functions
[ https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14482297#comment-14482297 ] Sai Nishanth Parepally edited comment on SPARK-3219 at 4/7/15 6:10 PM: --- [~mengxr], is https://github.com/derrickburns/generalized-kmeans-clustering going to be merged into mllib as I would like to use jaccard distance as a distance metric for kmeans clustering? and I would like to know if I should add this distance metric to derrickburns's repository or just make the current mllib's implementation of kmeans accept a method which computes the distance between any two points? was (Author: nishanthps): [~mengxr], is https://github.com/derrickburns/generalized-kmeans-clustering going to be merged into mllib as I would like to use jaccard distance as a distance metric for kmeans clustering? K-Means clusterer should support Bregman distance functions --- Key: SPARK-3219 URL: https://issues.apache.org/jira/browse/SPARK-3219 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Derrick Burns Assignee: Derrick Burns Labels: clustering The K-Means clusterer supports the Euclidean distance metric. However, it is rather straightforward to support Bregman (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) distance functions which would increase the utility of the clusterer tremendously. I have modified the clusterer to support pluggable distance functions. However, I notice that there are hundreds of outstanding pull requests. If someone is willing to work with me to sponsor the work through the process, I will create a pull request. Otherwise, I will just keep my own fork. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6751) Spark History Server support multiple application attempts
Thomas Graves created SPARK-6751: Summary: Spark History Server support multiple application attempts Key: SPARK-6751 URL: https://issues.apache.org/jira/browse/SPARK-6751 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 1.3.0 Reporter: Thomas Graves Spark on Yarn supports running multiple application attempts (configurable number) in case the first (or second..) attempts fail. The Spark History server only supports one history file though. Under the default configs it keeps the first attempts history file. You can set the undocumented config spark.eventLog.overwrite to allow the follow on attempts to overwrite the first attempts history file. Note that in spark 1.2 not having the overwrite config set causes any following attempts to actually fail to run, in spark 1.3 they run and you just see a warning at the end of the attempts. It would be really nice to have an option that keeps all the attempts history files. This way a user can go back and look at each one individually. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6733) Suppression of usage of Scala existential code should be done
[ https://issues.apache.org/jira/browse/SPARK-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-6733. Resolution: Fixed Fix Version/s: 1.4.0 Suppression of usage of Scala existential code should be done - Key: SPARK-6733 URL: https://issues.apache.org/jira/browse/SPARK-6733 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 1.3.0 Environment: OS: OSX Yosemite Hardware: Intel Core i7 with 16 GB RAM Reporter: Raymond Tay Priority: Trivial Fix For: 1.4.0 The inclusion of this statement in the file {code:scala} import scala.language.existentials {code} should have suppressed all warnings regarding the use of scala existential code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6750) Upgrade ScalaStyle to 0.7
[ https://issues.apache.org/jira/browse/SPARK-6750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483642#comment-14483642 ] Apache Spark commented on SPARK-6750: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/5399 Upgrade ScalaStyle to 0.7 - Key: SPARK-6750 URL: https://issues.apache.org/jira/browse/SPARK-6750 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Reynold Xin Assignee: Reynold Xin 0.7 fixes a bug that's pretty useful, i.e. inline functions no longer return explicit type definition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6746) Refactor large functions in DAGScheduler to improve readibility
[ https://issues.apache.org/jira/browse/SPARK-6746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483547#comment-14483547 ] Apache Spark commented on SPARK-6746: - User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/5396 Refactor large functions in DAGScheduler to improve readibility --- Key: SPARK-6746 URL: https://issues.apache.org/jira/browse/SPARK-6746 Project: Spark Issue Type: Sub-task Components: Scheduler Reporter: Ilya Ganelin The DAGScheduler class contains two huge functions that make it very hard to understand what's going on in the code. These are: 1) The monolithic handleTaskCompletion 2) The cleanupStateForJobAndIndependentStages function These can be simply modularized to eliminate some awkward type casting and improve code readability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6748) QueryPlan.schema should be a lazy val to avoid creating excessive duplicate StructType objects
Cheng Lian created SPARK-6748: - Summary: QueryPlan.schema should be a lazy val to avoid creating excessive duplicate StructType objects Key: SPARK-6748 URL: https://issues.apache.org/jira/browse/SPARK-6748 Project: Spark Issue Type: Bug Affects Versions: 1.3.0 Reporter: Cheng Lian Spotted this issue while trying to do a simple micro benchmark: {code} sc.parallelize(1 to 1000). map(i = (i, sval_$i)). toDF(key, value). saveAsParquetFile(file:///tmp/src.parquet) sqlContext.parquetFile(file:///tmp/src.parquet).collect() {code} YJP profiling result showed that, *10 million {{StructType}}, 10 million {{StructField \[\]}}, and 20 million {{StructField}} were allocated*. It turned out that {{DataFrame.collect()}} calls {{SparkPlan.executeCollect()}}, which consists of a single line: {code} execute().map(ScalaReflection.convertRowToScala(_, schema)).collect() {code} The problem is that, {{QueryPlan.schema}} is a function, and since 1.3.0, {{convertRowToScala}} starts returning a {{GenericRowWithSchema}}. These two facts result in 10 million rows, each with a separate schema object. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6737) OutputCommitCoordinator.authorizedCommittersByStage map out of memory
[ https://issues.apache.org/jira/browse/SPARK-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483609#comment-14483609 ] Apache Spark commented on SPARK-6737: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/5397 OutputCommitCoordinator.authorizedCommittersByStage map out of memory - Key: SPARK-6737 URL: https://issues.apache.org/jira/browse/SPARK-6737 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core, Streaming Affects Versions: 1.3.0, 1.3.1 Environment: spark 1.3.1 Reporter: Tao Li Assignee: Josh Rosen Priority: Critical Labels: Bug, Core, DAGScheduler, OOM, Streaming I am using spark streaming(1.3.1) as a long time running service and out of memory after running for 7 days. I found that the field authorizedCommittersByStage in OutputCommitCoordinator class cause the OOM. authorizedCommittersByStage is a map, key is StageId, value is Map[PartitionId, TaskAttemptId]. The OutputCommitCoordinator class has a method stageEnd which will remove stageId from authorizedCommittersByStage. But the method stageEnd is never called by DAGSchedule. And it cause the authorizedCommittersByStage's stage info never be cleaned, which cause OOM. It happens in my spark streaming program(1.3.1), I am not sure if it will appear in other spark components and other spark version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6737) OutputCommitCoordinator.authorizedCommittersByStage map out of memory
[ https://issues.apache.org/jira/browse/SPARK-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6737: --- Assignee: Apache Spark (was: Josh Rosen) OutputCommitCoordinator.authorizedCommittersByStage map out of memory - Key: SPARK-6737 URL: https://issues.apache.org/jira/browse/SPARK-6737 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core, Streaming Affects Versions: 1.3.0, 1.3.1 Environment: spark 1.3.1 Reporter: Tao Li Assignee: Apache Spark Priority: Critical Labels: Bug, Core, DAGScheduler, OOM, Streaming I am using spark streaming(1.3.1) as a long time running service and out of memory after running for 7 days. I found that the field authorizedCommittersByStage in OutputCommitCoordinator class cause the OOM. authorizedCommittersByStage is a map, key is StageId, value is Map[PartitionId, TaskAttemptId]. The OutputCommitCoordinator class has a method stageEnd which will remove stageId from authorizedCommittersByStage. But the method stageEnd is never called by DAGSchedule. And it cause the authorizedCommittersByStage's stage info never be cleaned, which cause OOM. It happens in my spark streaming program(1.3.1), I am not sure if it will appear in other spark components and other spark version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6737) OutputCommitCoordinator.authorizedCommittersByStage map out of memory
[ https://issues.apache.org/jira/browse/SPARK-6737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6737: --- Assignee: Josh Rosen (was: Apache Spark) OutputCommitCoordinator.authorizedCommittersByStage map out of memory - Key: SPARK-6737 URL: https://issues.apache.org/jira/browse/SPARK-6737 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core, Streaming Affects Versions: 1.3.0, 1.3.1 Environment: spark 1.3.1 Reporter: Tao Li Assignee: Josh Rosen Priority: Critical Labels: Bug, Core, DAGScheduler, OOM, Streaming I am using spark streaming(1.3.1) as a long time running service and out of memory after running for 7 days. I found that the field authorizedCommittersByStage in OutputCommitCoordinator class cause the OOM. authorizedCommittersByStage is a map, key is StageId, value is Map[PartitionId, TaskAttemptId]. The OutputCommitCoordinator class has a method stageEnd which will remove stageId from authorizedCommittersByStage. But the method stageEnd is never called by DAGSchedule. And it cause the authorizedCommittersByStage's stage info never be cleaned, which cause OOM. It happens in my spark streaming program(1.3.1), I am not sure if it will appear in other spark components and other spark version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org