[jira] [Commented] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled
[ https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315024#comment-14315024 ] Marcelo Vanzin commented on SPARK-4705: --- Hi [~twinkle], I think the UI on the latest screenshot is a little too cluttered. How about: - Keep the app id as the main link to the application's UI, pointing at the last attempt in the case of multiple attempts - Have the attempt column list the attempt IDs only for those apps that have multiple attempts. Those with a single attempt would have an empty cell. This would result in a redundant link (app id link + link to the last attempt pointing at the same place), but I think it looks better. And it's probably less confusing for those used to the current UI. Driver retries in yarn-cluster mode always fail if event logging is enabled --- Key: SPARK-4705 URL: https://issues.apache.org/jira/browse/SPARK-4705 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Marcelo Vanzin Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png yarn-cluster mode will retry to run the driver in certain failure modes. If even logging is enabled, this will most probably fail, because: {noformat} Exception in thread Driver java.io.IOException: Log directory hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003 already exists! at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:353) {noformat} The even log path should be more unique. Or perhaps retries of the same app should clean up the old logs first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5592) java.net.URISyntaxException when insert data to a partitioned table
[ https://issues.apache.org/jira/browse/SPARK-5592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-5592: -- Assignee: wangfei java.net.URISyntaxException when insert data to a partitioned table - Key: SPARK-5592 URL: https://issues.apache.org/jira/browse/SPARK-5592 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: wangfei Assignee: wangfei Fix For: 1.3.0 create table sc as select * from (select '2011-01-11', '2011-01-11+14:18:26' from src tablesample (1 rows) union all select '2011-01-11', '2011-01-11+15:18:26' from src tablesample (1 rows) union all select '2011-01-11', '2011-01-11+16:18:26' from src tablesample (1 rows) ) s; create table sc_part (key string) partitioned by (ts string) stored as rcfile; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table sc_part partition(ts) select * from sc; java.net.URISyntaxException: Relative path in absolute URI: ts=2011-01-11+15:18:26 at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.init(Path.java:172) at org.apache.hadoop.fs.Path.init(Path.java:94) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.org$apache$spark$sql$hive$SparkHiveDynamicPartitionWriterContainer$$newWriter$1(hiveWriterContainers.scala:230) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243) at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189) at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.getLocalFileWriter(hiveWriterContainers.scala:243) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:113) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:105) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:105) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: java.net.URISyntaxException: Relative path in absolute URI: ts=2011-01-11+15:18:26 at java.net.URI.checkPath(URI.java:1804) at java.net.URI.init(URI.java:752) at org.apache.hadoop.fs.Path.initialize(Path.java:203) ... 21 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5592) java.net.URISyntaxException when insert data to a partitioned table
[ https://issues.apache.org/jira/browse/SPARK-5592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-5592. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4368 [https://github.com/apache/spark/pull/4368] java.net.URISyntaxException when insert data to a partitioned table - Key: SPARK-5592 URL: https://issues.apache.org/jira/browse/SPARK-5592 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: wangfei Fix For: 1.3.0 create table sc as select * from (select '2011-01-11', '2011-01-11+14:18:26' from src tablesample (1 rows) union all select '2011-01-11', '2011-01-11+15:18:26' from src tablesample (1 rows) union all select '2011-01-11', '2011-01-11+16:18:26' from src tablesample (1 rows) ) s; create table sc_part (key string) partitioned by (ts string) stored as rcfile; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table sc_part partition(ts) select * from sc; java.net.URISyntaxException: Relative path in absolute URI: ts=2011-01-11+15:18:26 at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.init(Path.java:172) at org.apache.hadoop.fs.Path.init(Path.java:94) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.org$apache$spark$sql$hive$SparkHiveDynamicPartitionWriterContainer$$newWriter$1(hiveWriterContainers.scala:230) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243) at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189) at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.getLocalFileWriter(hiveWriterContainers.scala:243) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:113) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:105) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:105) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: java.net.URISyntaxException: Relative path in absolute URI: ts=2011-01-11+15:18:26 at java.net.URI.checkPath(URI.java:1804) at java.net.URI.init(URI.java:752) at org.apache.hadoop.fs.Path.initialize(Path.java:203) ... 21 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5668) spark_ec2.py region parameter could be either mandatory or its value displayed
[ https://issues.apache.org/jira/browse/SPARK-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5668. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4457 [https://github.com/apache/spark/pull/4457] spark_ec2.py region parameter could be either mandatory or its value displayed -- Key: SPARK-5668 URL: https://issues.apache.org/jira/browse/SPARK-5668 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.2.0, 1.3.0, 1.4.0 Reporter: Miguel Peralvo Priority: Minor Labels: starter Fix For: 1.4.0 If the region parameter is not specified when invoking spark-ec2 (spark-ec2.py behind the scenes) it defaults to us-east-1. When the cluster doesn't belong to that region, after showing the Searching for existing cluster Spark... message, it causes an ERROR: Could not find any existing cluster exception because it doesn't find you cluster in the default region. As it doesn't tell you anything about the region, It can be a small headache for new users. In http://stackoverflow.com/questions/21171576/why-does-spark-ec2-fail-with-error-could-not-find-any-existing-cluster, Dmitriy Selivanov explains it. I propose that: 1. Either we make the search message a little bit more informative with something like Searching for existing cluster Spark in region + opts.region. 2. Or we remove the us-east-1 as default and make the --region parameter mandatory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5668) spark_ec2.py region parameter could be either mandatory or its value displayed
[ https://issues.apache.org/jira/browse/SPARK-5668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miguel Peralvo closed SPARK-5668. - spark_ec2.py region parameter could be either mandatory or its value displayed -- Key: SPARK-5668 URL: https://issues.apache.org/jira/browse/SPARK-5668 Project: Spark Issue Type: Improvement Components: EC2 Affects Versions: 1.2.0, 1.3.0, 1.4.0 Reporter: Miguel Peralvo Priority: Minor Labels: starter Fix For: 1.4.0 If the region parameter is not specified when invoking spark-ec2 (spark-ec2.py behind the scenes) it defaults to us-east-1. When the cluster doesn't belong to that region, after showing the Searching for existing cluster Spark... message, it causes an ERROR: Could not find any existing cluster exception because it doesn't find you cluster in the default region. As it doesn't tell you anything about the region, It can be a small headache for new users. In http://stackoverflow.com/questions/21171576/why-does-spark-ec2-fail-with-error-could-not-find-any-existing-cluster, Dmitriy Selivanov explains it. I propose that: 1. Either we make the search message a little bit more informative with something like Searching for existing cluster Spark in region + opts.region. 2. Or we remove the us-east-1 as default and make the --region parameter mandatory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5613) YarnClientSchedulerBackend fails to get application report when yarn restarts
[ https://issues.apache.org/jira/browse/SPARK-5613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14314821#comment-14314821 ] Patrick Wendell commented on SPARK-5613: I have cherry picked it into the 1.3 branch. YarnClientSchedulerBackend fails to get application report when yarn restarts - Key: SPARK-5613 URL: https://issues.apache.org/jira/browse/SPARK-5613 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Kashish Jain Assignee: Kashish Jain Priority: Minor Fix For: 1.3.0, 1.2.2 Original Estimate: 24h Remaining Estimate: 24h Steps to Reproduce 1) Run any spark job 2) Stop yarn while the spark job is running (an application id has been generated by now) 3) Restart yarn now 4) AsyncMonitorApplication thread fails due to ApplicationNotFoundException exception. This leads to termination of thread. Here is the StackTrace 15/02/05 05:22:37 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:38 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:39 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 15/02/05 05:22:40 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 5/02/05 05:22:40 INFO Client: Retrying connect to server: nn1/192.168.173.176:8032. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) Exception in thread Yarn application state monitor org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1423113179043_0003' doesn't exist in RM. at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:284) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:145) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:321) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:166) at sun.reflect.GeneratedMethodAccessor18.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at com.sun.proxy.$Proxy12.getApplicationReport(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:291) at org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:116) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:120) Caused by:
[jira] [Updated] (SPARK-5592) java.net.URISyntaxException when insert data to a partitioned table
[ https://issues.apache.org/jira/browse/SPARK-5592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-5592: -- Description: {code} create table sc as select * from (select '2011-01-11', '2011-01-11+14:18:26' from src tablesample (1 rows) union all select '2011-01-11', '2011-01-11+15:18:26' from src tablesample (1 rows) union all select '2011-01-11', '2011-01-11+16:18:26' from src tablesample (1 rows) ) s; create table sc_part (key string) partitioned by (ts string) stored as rcfile; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table sc_part partition(ts) select * from sc; {code} Exception thrown: {code} java.net.URISyntaxException: Relative path in absolute URI: ts=2011-01-11+15:18:26 at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.init(Path.java:172) at org.apache.hadoop.fs.Path.init(Path.java:94) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.org$apache$spark$sql$hive$SparkHiveDynamicPartitionWriterContainer$$newWriter$1(hiveWriterContainers.scala:230) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243) at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189) at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.getLocalFileWriter(hiveWriterContainers.scala:243) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:113) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:105) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:105) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:87) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: java.net.URISyntaxException: Relative path in absolute URI: ts=2011-01-11+15:18:26 at java.net.URI.checkPath(URI.java:1804) at java.net.URI.init(URI.java:752) at org.apache.hadoop.fs.Path.initialize(Path.java:203) ... 21 more {code} was: create table sc as select * from (select '2011-01-11', '2011-01-11+14:18:26' from src tablesample (1 rows) union all select '2011-01-11', '2011-01-11+15:18:26' from src tablesample (1 rows) union all select '2011-01-11', '2011-01-11+16:18:26' from src tablesample (1 rows) ) s; create table sc_part (key string) partitioned by (ts string) stored as rcfile; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table sc_part partition(ts) select * from sc; java.net.URISyntaxException: Relative path in absolute URI: ts=2011-01-11+15:18:26 at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.init(Path.java:172) at org.apache.hadoop.fs.Path.init(Path.java:94) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.org$apache$spark$sql$hive$SparkHiveDynamicPartitionWriterContainer$$newWriter$1(hiveWriterContainers.scala:230) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243) at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer$$anonfun$getLocalFileWriter$1.apply(hiveWriterContainers.scala:243) at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189) at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91) at
[jira] [Commented] (SPARK-5645) Track local bytes read for shuffles - update UI
[ https://issues.apache.org/jira/browse/SPARK-5645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14314811#comment-14314811 ] Apache Spark commented on SPARK-5645: - User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/4510 Track local bytes read for shuffles - update UI --- Key: SPARK-5645 URL: https://issues.apache.org/jira/browse/SPARK-5645 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Kostas Sakellis Assignee: Kostas Sakellis Currently we do not track the local bytes read for a shuffle read. The UI only shows the remote bytes read. This is pretty confusing to the user because: 1) In local mode all shuffle reads are local 2) the shuffle bytes written from the previous stage might not add up if there are some bytes that are read locally on the shuffle read side 3) With https://github.com/apache/spark/pull/4067 we display the total number of records so that won't line up with only showing the remote bytes read. I propose we track the remote and local bytes read separately. In the UI show the total bytes read and in brackets show the remote bytes read for a shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5722) Infer_schma_type incorrect for Integers in pyspark
Don Drake created SPARK-5722: Summary: Infer_schma_type incorrect for Integers in pyspark Key: SPARK-5722 URL: https://issues.apache.org/jira/browse/SPARK-5722 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Don Drake The Integers datatype in Python does not match what a Scala/Java integer is defined as. This causes inference of data types and schemas to fail when data is larger than 2^32 and it is inferred incorrectly as an Integer. Since the range of valid Python integers is wider than Java Integers, this causes problems when inferring Integer vs. Long datatypes. This will cause problems when attempting to save SchemaRDD as Parquet or JSON. Here's an example: sqlCtx = SQLContext(sc) from pyspark.sql import Row rdd = sc.parallelize([Row(f1='a', f2=100)]) srdd = sqlCtx.inferSchema(rdd) srdd.schema() StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) That number is a LongType in Java, but an Integer in python. We need to check the value to see if it should really by a LongType when a IntegerType is initially inferred. More tests: from pyspark.sql import _infer_type # OK print _infer_type(1) IntegerType # OK print _infer_type(2**31-1) IntegerType #WRONG print _infer_type(2**31) #WRONG IntegerType print _infer_type(2**61 ) #OK IntegerType print _infer_type(2**71 ) LongType Java Primitive Types defined: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html Python Built-in Types: https://docs.python.org/2/library/stdtypes.html#typesnumeric -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5732) Add a option to print the spark version in spark script
[ https://issues.apache.org/jira/browse/SPARK-5732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-5732: Description: Naturally, we may need to add a option to Add a option to print the spark version in spark script --- Key: SPARK-5732 URL: https://issues.apache.org/jira/browse/SPARK-5732 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: uncleGen Priority: Minor Naturally, we may need to add a option to -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5732) Add a option to print the spark version in spark script
[ https://issues.apache.org/jira/browse/SPARK-5732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-5732: Description: Naturally, we may need to add a option to print the spark version in spark script. It (was: Naturally, we may need to add a option to ) Add a option to print the spark version in spark script --- Key: SPARK-5732 URL: https://issues.apache.org/jira/browse/SPARK-5732 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: uncleGen Priority: Minor Naturally, we may need to add a option to print the spark version in spark script. It -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5732) Add an option to print the spark version in spark script
[ https://issues.apache.org/jira/browse/SPARK-5732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-5732: Summary: Add an option to print the spark version in spark script (was: Add a option to print the spark version in spark script) Add an option to print the spark version in spark script Key: SPARK-5732 URL: https://issues.apache.org/jira/browse/SPARK-5732 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: uncleGen Priority: Minor Naturally, we may need to add an option to print the spark version in spark script. It is pretty common in many script tools -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5732) Add a option to print the spark version in spark script
[ https://issues.apache.org/jira/browse/SPARK-5732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-5732: Description: Naturally, we may need to add an option to print the spark version in spark script. It is pretty common in many script tools (was: Naturally, we may need to add a option to print the spark version in spark script. It is pretty common in many script tools) Add a option to print the spark version in spark script --- Key: SPARK-5732 URL: https://issues.apache.org/jira/browse/SPARK-5732 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: uncleGen Priority: Minor Naturally, we may need to add an option to print the spark version in spark script. It is pretty common in many script tools -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5732) Add a option to print the spark version in spark script
[ https://issues.apache.org/jira/browse/SPARK-5732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] uncleGen updated SPARK-5732: Description: Naturally, we may need to add a option to print the spark version in spark script. It is pretty common in many script tools (was: Naturally, we may need to add a option to print the spark version in spark script. It) Add a option to print the spark version in spark script --- Key: SPARK-5732 URL: https://issues.apache.org/jira/browse/SPARK-5732 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: uncleGen Priority: Minor Naturally, we may need to add a option to print the spark version in spark script. It is pretty common in many script tools -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5733) Error Link in Pagination of HistroyPage when showing Incomplete Applications
Mars Gu created SPARK-5733: -- Summary: Error Link in Pagination of HistroyPage when showing Incomplete Applications Key: SPARK-5733 URL: https://issues.apache.org/jira/browse/SPARK-5733 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Mars Gu The links in pagination of HistroyPage is wrong when showing Incomplete Applications. If 2 is click on the following page http://history-server:18080/?page=1showIncomplete=true;, it will go to http://history-server:18080/?page=2; instead of http://history-server:18080/?page=2showIncomplete=true;. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5654) Integrate SparkR into Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315605#comment-14315605 ] Jason Dai commented on SPARK-5654: -- I agree with this proposal. Given all ongoing efforts around data analytics in Spark (e.g., DataFrame, ml, etc.), an R frontend for Spark seems to be very well aligned with the project's future plans. Integrate SparkR into Apache Spark -- Key: SPARK-5654 URL: https://issues.apache.org/jira/browse/SPARK-5654 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Shivaram Venkataraman The SparkR project [1] provides a light-weight frontend to launch Spark jobs from R. The project was started at the AMPLab around a year ago and has been incubated as its own project to make sure it can be easily merged into upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s goals are similar to PySpark and shares a similar design pattern as described in our meetup talk[2], Spark Summit presentation[3]. Integrating SparkR into the Apache project will enable R users to use Spark out of the box and given R’s large user base, it will help the Spark project reach more users. Additionally, work in progress features like providing R integration with ML Pipelines and Dataframes can be better achieved by development in a unified code base. SparkR is available under the Apache 2.0 License and does not have any external dependencies other than requiring users to have R and Java installed on their machines. SparkR’s developers come from many organizations including UC Berkeley, Alteryx, Intel and we will support future development, maintenance after the integration. [1] https://github.com/amplab-extras/SparkR-pkg [2] http://files.meetup.com/3138542/SparkR-meetup.pdf [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5702) Allow short names for built-in data sources
[ https://issues.apache.org/jira/browse/SPARK-5702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5702. Resolution: Fixed Fix Version/s: 1.3.0 Allow short names for built-in data sources --- Key: SPARK-5702 URL: https://issues.apache.org/jira/browse/SPARK-5702 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.3.0 e.g. json, parquet, jdbc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5677) Python DataFrame API remaining tasks
[ https://issues.apache.org/jira/browse/SPARK-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5677: --- Description: - DataFrame.renameColumn - DataFrame.show (also we should override __repr__ or __str__) - dtypes should use simpleString rather than jsonValue was: - DataFrame.renameColumn - DataFrame.show (also we should override __repr__ or __str__) Python DataFrame API remaining tasks Key: SPARK-5677 URL: https://issues.apache.org/jira/browse/SPARK-5677 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Davies Liu - DataFrame.renameColumn - DataFrame.show (also we should override __repr__ or __str__) - dtypes should use simpleString rather than jsonValue -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4382) Add locations parameter to Twitter Stream
[ https://issues.apache.org/jira/browse/SPARK-4382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4382: --- Component/s: Streaming Add locations parameter to Twitter Stream - Key: SPARK-4382 URL: https://issues.apache.org/jira/browse/SPARK-4382 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Liang-Chi Hsieh When we request Tweet stream, geo-location is one of the most important parameters. In addition to the track parameter, the locations parameter is widely used to ask for the Tweets falling within the requested bounding boxes. This PR adds the locations parameter to existing APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3688) LogicalPlan can't resolve column correctlly
[ https://issues.apache.org/jira/browse/SPARK-3688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Tian updated SPARK-3688: --- Description: How to reproduce this problem: {code} CREATE TABLE t1(x INT); CREATE TABLE t2(a STRUCTx: INT, k INT); SELECT a.x FROM t1 a JOIN t2 b ON a.x = b.k; {code} was: How to reproduce this problem: create a table: {code} create table test (a string, b string); {code} execute sql: {code} select a.b ,count(1) from test a join test t group by a.b; {code} LogicalPlan can't resolve column correctlly --- Key: SPARK-3688 URL: https://issues.apache.org/jira/browse/SPARK-3688 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yi Tian How to reproduce this problem: {code} CREATE TABLE t1(x INT); CREATE TABLE t2(a STRUCTx: INT, k INT); SELECT a.x FROM t1 a JOIN t2 b ON a.x = b.k; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5734) Allow creating a DataFrame from local Python data
Reynold Xin created SPARK-5734: -- Summary: Allow creating a DataFrame from local Python data Key: SPARK-5734 URL: https://issues.apache.org/jira/browse/SPARK-5734 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin Assignee: Davies Liu Maybe a local Python list and a Pandas DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5677) Python DataFrame API remaining tasks
[ https://issues.apache.org/jira/browse/SPARK-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5677: --- Description: - DataFrame.renameColumn - DataFrame.show (also we should override __repr__ or __str__) was: - DataFrame.renameColumn - DataFrame.show - Create data frame from local data collection (does this work?) - Move all data types into a types package - load/saveAsTable/createExternalTable, etc ( see https://github.com/apache/spark/pull/4446 ) Python DataFrame API remaining tasks Key: SPARK-5677 URL: https://issues.apache.org/jira/browse/SPARK-5677 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Davies Liu - DataFrame.renameColumn - DataFrame.show (also we should override __repr__ or __str__) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5714) Refactor initial step of LDA to remove redundant operations
[ https://issues.apache.org/jira/browse/SPARK-5714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5714. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4501 [https://github.com/apache/spark/pull/4501] Refactor initial step of LDA to remove redundant operations --- Key: SPARK-5714 URL: https://issues.apache.org/jira/browse/SPARK-5714 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Liang-Chi Hsieh Priority: Minor Fix For: 1.3.0 The initialState of LDA performs several RDD operations that looks redundant. This pr tries to simplify these operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5714) Refactor initial step of LDA to remove redundant operations
[ https://issues.apache.org/jira/browse/SPARK-5714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5714: - Assignee: Liang-Chi Hsieh Refactor initial step of LDA to remove redundant operations --- Key: SPARK-5714 URL: https://issues.apache.org/jira/browse/SPARK-5714 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Priority: Minor Fix For: 1.3.0 The initialState of LDA performs several RDD operations that looks redundant. This pr tries to simplify these operations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5522) Accelerate the Histroty Server start
[ https://issues.apache.org/jira/browse/SPARK-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mars Gu updated SPARK-5522: --- Description: When starting the history server, all the log files will be fetched and parsed in order to get the applications' meta data e.g. App Name, Start Time, Duration, etc. In our production cluster, there exist 2600 log files (160G) in HDFS and it costs 3 hours to restart the history server, which is a little bit too long for us. It would be better, if the history server can show logs with missing information during start-up and fill the missing information after fetching and parsing a log file. was: When starting the history server, all the log files will be fetched and parsed in order to get the applications' meta data e.g. App Name, Start Time, Duration, etc. In our production cluster, there exist 2600 log files (160G) in HDFS and it costs 3 hours to restart the history server, which is a little bit too long for us. It would be better, if the history server does not fetch all the log files but only the meta data during start-up. Accelerate the Histroty Server start Key: SPARK-5522 URL: https://issues.apache.org/jira/browse/SPARK-5522 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Mars Gu When starting the history server, all the log files will be fetched and parsed in order to get the applications' meta data e.g. App Name, Start Time, Duration, etc. In our production cluster, there exist 2600 log files (160G) in HDFS and it costs 3 hours to restart the history server, which is a little bit too long for us. It would be better, if the history server can show logs with missing information during start-up and fill the missing information after fetching and parsing a log file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5522) Accelerate the Histroty Server start
[ https://issues.apache.org/jira/browse/SPARK-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315686#comment-14315686 ] Apache Spark commented on SPARK-5522: - User 'marsishandsome' has created a pull request for this issue: https://github.com/apache/spark/pull/4525 Accelerate the Histroty Server start Key: SPARK-5522 URL: https://issues.apache.org/jira/browse/SPARK-5522 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Mars Gu When starting the history server, all the log files will be fetched and parsed in order to get the applications' meta data e.g. App Name, Start Time, Duration, etc. In our production cluster, there exist 2600 log files (160G) in HDFS and it costs 3 hours to restart the history server, which is a little bit too long for us. It would be better, if the history server can show logs with missing information during start-up and fill the missing information after fetching and parsing a log file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5522) Accelerate the Histroty Server start
[ https://issues.apache.org/jira/browse/SPARK-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315683#comment-14315683 ] Mars Gu commented on SPARK-5522: https://github.com/apache/spark/pull/4525 Accelerate the Histroty Server start Key: SPARK-5522 URL: https://issues.apache.org/jira/browse/SPARK-5522 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Mars Gu When starting the history server, all the log files will be fetched and parsed in order to get the applications' meta data e.g. App Name, Start Time, Duration, etc. In our production cluster, there exist 2600 log files (160G) in HDFS and it costs 3 hours to restart the history server, which is a little bit too long for us. It would be better, if the history server can show logs with missing information during start-up and fill the missing information after fetching and parsing a log file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark
[ https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315569#comment-14315569 ] Apache Spark commented on SPARK-5722: - User 'dondrake' has created a pull request for this issue: https://github.com/apache/spark/pull/4521 Infer_schema_type incorrect for Integers in pyspark --- Key: SPARK-5722 URL: https://issues.apache.org/jira/browse/SPARK-5722 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Don Drake The Integers datatype in Python does not match what a Scala/Java integer is defined as. This causes inference of data types and schemas to fail when data is larger than 2^32 and it is inferred incorrectly as an Integer. Since the range of valid Python integers is wider than Java Integers, this causes problems when inferring Integer vs. Long datatypes. This will cause problems when attempting to save SchemaRDD as Parquet or JSON. Here's an example: {code} sqlCtx = SQLContext(sc) from pyspark.sql import Row rdd = sc.parallelize([Row(f1='a', f2=100)]) srdd = sqlCtx.inferSchema(rdd) srdd.schema() StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) {code} That number is a LongType in Java, but an Integer in python. We need to check the value to see if it should really by a LongType when a IntegerType is initially inferred. More tests: {code} from pyspark.sql import _infer_type # OK print _infer_type(1) IntegerType # OK print _infer_type(2**31-1) IntegerType #WRONG print _infer_type(2**31) #WRONG IntegerType print _infer_type(2**61 ) #OK IntegerType print _infer_type(2**71 ) LongType {code} Java Primitive Types defined: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html Python Built-in Types: https://docs.python.org/2/library/stdtypes.html#typesnumeric -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
[ https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315695#comment-14315695 ] Manoj Kumar commented on SPARK-5016: [~tgaloppo] How about a method setParallelGaussianUpdate(bool) (defaulting to False) which would allow the user to decide whether to use this feature or not? [~mengxr] I would like to your know your thoughts on this as well. GaussianMixtureEM should distribute matrix inverse for large numFeatures, k --- Key: SPARK-5016 URL: https://issues.apache.org/jira/browse/SPARK-5016 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley If numFeatures or k are large, GMM EM should distribute the matrix inverse computation for Gaussian initialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5735) Replace uses of EasyMock with Mockito
Patrick Wendell created SPARK-5735: -- Summary: Replace uses of EasyMock with Mockito Key: SPARK-5735 URL: https://issues.apache.org/jira/browse/SPARK-5735 Project: Spark Issue Type: Improvement Components: Tests Reporter: Patrick Wendell There are a few reasons we should drop EasyMock. First, we should have a single mocking framework in our tests in general to keep things consistent. Second, EasyMock has caused us some dependency pain in our tests due to objenesis. We aren't totally sure but suspect such conflicts might be causing non deterministic test failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5735) Replace uses of EasyMock with Mockito
[ https://issues.apache.org/jira/browse/SPARK-5735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5735: --- Assignee: Josh Rosen Replace uses of EasyMock with Mockito - Key: SPARK-5735 URL: https://issues.apache.org/jira/browse/SPARK-5735 Project: Spark Issue Type: Improvement Components: Tests Reporter: Patrick Wendell Assignee: Josh Rosen There are a few reasons we should drop EasyMock. First, we should have a single mocking framework in our tests in general to keep things consistent. Second, EasyMock has caused us some dependency pain in our tests due to objenesis. We aren't totally sure but suspect such conflicts might be causing non deterministic test failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5732) Add a option to print the spark version in spark script
uncleGen created SPARK-5732: --- Summary: Add a option to print the spark version in spark script Key: SPARK-5732 URL: https://issues.apache.org/jira/browse/SPARK-5732 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: uncleGen Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5732) Add an option to print the spark version in spark script
[ https://issues.apache.org/jira/browse/SPARK-5732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315585#comment-14315585 ] Apache Spark commented on SPARK-5732: - User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/4522 Add an option to print the spark version in spark script Key: SPARK-5732 URL: https://issues.apache.org/jira/browse/SPARK-5732 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: uncleGen Priority: Minor Naturally, we may need to add an option to print the spark version in spark script. It is pretty common in many script tools -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5733) Error Link in Pagination of HistroyPage when showing Incomplete Applications
[ https://issues.apache.org/jira/browse/SPARK-5733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315614#comment-14315614 ] Apache Spark commented on SPARK-5733: - User 'marsishandsome' has created a pull request for this issue: https://github.com/apache/spark/pull/4523 Error Link in Pagination of HistroyPage when showing Incomplete Applications - Key: SPARK-5733 URL: https://issues.apache.org/jira/browse/SPARK-5733 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Mars Gu The links in pagination of HistroyPage is wrong when showing Incomplete Applications. If 2 is click on the following page http://history-server:18080/?page=1showIncomplete=true;, it will go to http://history-server:18080/?page=2; instead of http://history-server:18080/?page=2showIncomplete=true;. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5733) Error Link in Pagination of HistroyPage when showing Incomplete Applications
[ https://issues.apache.org/jira/browse/SPARK-5733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315611#comment-14315611 ] Mars Gu commented on SPARK-5733: https://github.com/apache/spark/pull/4523 Error Link in Pagination of HistroyPage when showing Incomplete Applications - Key: SPARK-5733 URL: https://issues.apache.org/jira/browse/SPARK-5733 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.0 Reporter: Mars Gu The links in pagination of HistroyPage is wrong when showing Incomplete Applications. If 2 is click on the following page http://history-server:18080/?page=1showIncomplete=true;, it will go to http://history-server:18080/?page=2; instead of http://history-server:18080/?page=2showIncomplete=true;. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4945) Add overwrite option support for SchemaRDD.saveAsParquetFile
[ https://issues.apache.org/jira/browse/SPARK-4945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-4945. -- Resolution: Implemented Add overwrite option support for SchemaRDD.saveAsParquetFile Key: SPARK-4945 URL: https://issues.apache.org/jira/browse/SPARK-4945 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3688) LogicalPlan can't resolve column correctlly
[ https://issues.apache.org/jira/browse/SPARK-3688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315625#comment-14315625 ] Apache Spark commented on SPARK-3688: - User 'tianyi' has created a pull request for this issue: https://github.com/apache/spark/pull/4524 LogicalPlan can't resolve column correctlly --- Key: SPARK-3688 URL: https://issues.apache.org/jira/browse/SPARK-3688 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Yi Tian How to reproduce this problem: {code} CREATE TABLE t1(x INT); CREATE TABLE t2(a STRUCTx: INT, k INT); SELECT a.x FROM t1 a JOIN t2 b ON a.x = b.k; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5183) Document data source API
[ https://issues.apache.org/jira/browse/SPARK-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5183: --- Description: We need to document the data types the caller needs to support. Document data source API Key: SPARK-5183 URL: https://issues.apache.org/jira/browse/SPARK-5183 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Reporter: Yin Huai Priority: Blocker We need to document the data types the caller needs to support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5568) Python API for the write support of the data source API
[ https://issues.apache.org/jira/browse/SPARK-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-5568. - Resolution: Fixed Fix Version/s: 1.3.0 It has been resolved by https://github.com/apache/spark/pull/4446. Python API for the write support of the data source API --- Key: SPARK-5568 URL: https://issues.apache.org/jira/browse/SPARK-5568 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5706) Support inference schema from a single json string
[ https://issues.apache.org/jira/browse/SPARK-5706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5706. -- Resolution: Duplicate Support inference schema from a single json string -- Key: SPARK-5706 URL: https://issues.apache.org/jira/browse/SPARK-5706 Project: Spark Issue Type: Improvement Components: SQL Reporter: Cheng Hao We notice some developers are complaining the json parsing is very slow, particularly in inferring schema. Some of them suggesting if we can provide an simple interface for inferring the schema by providing a single complete json string record, instead of sampling. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4336) auto detect type from json string
[ https://issues.apache.org/jira/browse/SPARK-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4336. -- Resolution: Won't Fix Per PR discussion, WontFix due to concerns over speed auto detect type from json string - Key: SPARK-4336 URL: https://issues.apache.org/jira/browse/SPARK-4336 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5724) misconfiguration in Akka system
Nan Zhu created SPARK-5724: -- Summary: misconfiguration in Akka system Key: SPARK-5724 URL: https://issues.apache.org/jira/browse/SPARK-5724 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0, 1.1.0 Reporter: Nan Zhu In AkkaUtil, we set several failure detector related the parameters as following {code:title=AkkaUtil.scala|borderStyle=solid} al akkaConf = ConfigFactory.parseMap(conf.getAkkaConf.toMap[String, String]) .withFallback(akkaSslConfig).withFallback(ConfigFactory.parseString( s |akka.daemonic = on |akka.loggers = [akka.event.slf4j.Slf4jLogger] |akka.stdout-loglevel = ERROR |akka.jvm-exit-on-fatal-error = off |akka.remote.require-cookie = $requireCookie |akka.remote.secure-cookie = $secureCookie |akka.remote.transport-failure-detector.heartbeat-interval = $akkaHeartBeatInterval s |akka.remote.transport-failure-detector.acceptable-heartbeat-pause = $akkaHeartBeatPauses s |akka.remote.transport-failure-detector.threshold = $akkaFailureDetector |akka.actor.provider = akka.remote.RemoteActorRefProvider |akka.remote.netty.tcp.transport-class = akka.remote.transport.netty.NettyTransport |akka.remote.netty.tcp.hostname = $host |akka.remote.netty.tcp.port = $port |akka.remote.netty.tcp.tcp-nodelay = on |akka.remote.netty.tcp.connection-timeout = $akkaTimeout s |akka.remote.netty.tcp.maximum-frame-size = ${akkaFrameSize}B |akka.remote.netty.tcp.execution-pool-size = $akkaThreads |akka.actor.default-dispatcher.throughput = $akkaBatchSize |akka.log-config-on-start = $logAkkaConfig |akka.remote.log-remote-lifecycle-events = $lifecycleEvents |akka.log-dead-letters = $lifecycleEvents |akka.log-dead-letters-during-shutdown = $lifecycleEvents .stripMargin)) {code} Actually, we do not have any parameter naming akka.remote.transport-failure-detector.threshold see: http://doc.akka.io/docs/akka/2.3.4/general/configuration.html what we have is akka.remote.watch-failure-detector.threshold -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5343) ShortestPaths traverses backwards
[ https://issues.apache.org/jira/browse/SPARK-5343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-5343. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4478 https://github.com/apache/spark/pull/4478 ShortestPaths traverses backwards - Key: SPARK-5343 URL: https://issues.apache.org/jira/browse/SPARK-5343 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.2.0 Reporter: Michael Malak Fix For: 1.3.0 GraphX ShortestPaths seems to be following edges backwards instead of forwards: import org.apache.spark.graphx._ val g = Graph(sc.makeRDD(Array((1L,), (2L,), (3L,))), sc.makeRDD(Array(Edge(1L,2L,), Edge(2L,3L, lib.ShortestPaths.run(g,Array(3)).vertices.collect res1: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map()), (3,Map(3 - 0)), (2,Map())) lib.ShortestPaths.run(g,Array(1)).vertices.collect res2: Array[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.lib.ShortestPaths.SPMap)] = Array((1,Map(1 - 0)), (3,Map(1 - 2)), (2,Map(1 - 1))) The following changes may be what will make it run forward: Change one occurrence of src to dst in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L64 Change three occurrences of dst to src in https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/ShortestPaths.scala#L65 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315087#comment-14315087 ] Andrew Ash commented on SPARK-4879: --- This is really great work [~joshrosen]! I really appreciate the effort you're putting into getting this one figured out since these kind of non-deterministic bugs are the most painful for both users and devs to figure out. Missing output partitions after job completes with speculative execution Key: SPARK-4879 URL: https://issues.apache.org/jira/browse/SPARK-4879 Project: Spark Issue Type: Bug Components: Input/Output, Spark Core Affects Versions: 1.0.2, 1.1.1, 1.2.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical Attachments: speculation.txt, speculation2.txt When speculative execution is enabled ({{spark.speculation=true}}), jobs that save output files may report that they have completed successfully even though some output partitions written by speculative tasks may be missing. h3. Reproduction This symptom was reported to me by a Spark user and I've been doing my own investigation to try to come up with an in-house reproduction. I'm still working on a reliable local reproduction for this issue, which is a little tricky because Spark won't schedule speculated tasks on the same host as the original task, so you need an actual (or containerized) multi-host cluster to test speculation. Here's a simple reproduction of some of the symptoms on EC2, which can be run in {{spark-shell}} with {{--conf spark.speculation=true}}: {code} // Rig a job such that all but one of the tasks complete instantly // and one task runs for 20 seconds on its first attempt and instantly // on its second attempt: val numTasks = 100 sc.parallelize(1 to numTasks, numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) = if (ctx.partitionId == 0) { // If this is the one task that should run really slow if (ctx.attemptId == 0) { // If this is the first attempt, run slow Thread.sleep(20 * 1000) } } iter }.map(x = (x, x)).saveAsTextFile(/test4) {code} When I run this, I end up with a job that completes quickly (due to speculation) but reports failures from the speculated task: {code} [...] 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal (100/100) 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at console:22) finished in 0.856 s 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at console:22, took 0.885438374 s 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event for 70.1 in stage 3.0 because task 70 has already completed successfully scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): java.io.IOException: Failed to save output of task: attempt_201412110141_0003_m_49_413 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160) org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} One interesting thing to note about this stack trace: if we look at {{FileOutputCommitter.java:160}} ([link|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.5.0-mr1-cdh5.2.0/org/apache/hadoop/mapred/FileOutputCommitter.java#160]), this point in the execution seems to correspond to a case where a task completes, attempts to commit its output, fails for some reason, then deletes the destination file, tries again, and fails: {code} if (fs.isFile(taskOutput)) { 152 Path finalOutputPath = getFinalPath(jobOutputDir, taskOutput, 153
[jira] [Resolved] (SPARK-5021) GaussianMixtureEM should be faster for SparseVector input
[ https://issues.apache.org/jira/browse/SPARK-5021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5021. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4459 [https://github.com/apache/spark/pull/4459] GaussianMixtureEM should be faster for SparseVector input - Key: SPARK-5021 URL: https://issues.apache.org/jira/browse/SPARK-5021 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar Fix For: 1.3.0 GaussianMixtureEM currently converts everything to dense vectors. It would be nice if it were faster for SparseVectors (running in time linear in the number of non-zero values). However, this may not be too important since clustering should rarely be done in high dimensions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315077#comment-14315077 ] Josh Rosen commented on SPARK-4879: --- This issue is _really_ hard to reproduce, but I managed to trigger the original bug as part of the testing for my patch. Here's what I ran: {code} ~/spark-1.3.0-SNAPSHOT-bin-1.0.4/bin/spark-shell --conf spark.speculation.multiplier=1 --conf spark.speculation.quantile=0.01 --conf spark.speculation=true --conf spark.hadoop.outputCommitCoordination.enabled=false {code} {code} val numTasks = 100 val numTrials = 100 val outputPath = /output-committer-bug- val sleepDuration = 1000 for (trial - 0 to (numTrials - 1)) { val outputLocation = outputPath + trial sc.parallelize(1 to numTasks, numTasks).mapPartitionsWithContext { case (ctx, iter) = if (ctx.partitionId % 5 == 0) { if (ctx.attemptNumber == 0) { // If this is the first attempt, run slow Thread.sleep(sleepDuration) } } iter }.map(identity).saveAsTextFile(outputLocation) Thread.sleep(sleepDuration * 2) println(TESTING OUTPUT OF TRIAL + trial) val savedData = sc.textFile(outputLocation).map(_.toInt).collect() if (savedData.toSet != (1 to numTasks).toSet) { println(MISSING: + ((1 to numTasks).toSet -- savedData.toSet)) assert(false) } println(- * 80) } {code} It took 22 runs until I actually observed missing output partitions (several of the earlier runs threw spurious exceptions and didn't have missing outputs): {code} [...] 15/02/10 22:17:21 INFO scheduler.DAGScheduler: Job 66 finished: saveAsTextFile at console:39, took 2.479592 s 15/02/10 22:17:21 WARN scheduler.TaskSetManager: Lost task 75.0 in stage 66.0 (TID 6861, ip-172-31-1-124.us-west-2.compute.internal): java.io.IOException: Failed to save output of task: attempt_201502102217_0066_m_75_6861 at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160) at org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) at org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) at org.apache.spark.SparkHadoopWriter.performCommit$1(SparkHadoopWriter.scala:113) at org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:150) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1082) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/02/10 22:17:21 WARN scheduler.TaskSetManager: Lost task 80.0 in stage 66.0 (TID 6866, ip-172-31-11-151.us-west-2.compute.internal): java.io.IOException: The temporary job-output directory hdfs://ec2-54-213-142-80.us-west-2.compute.amazonaws.com:9000/output-committer-bug-22/_temporary doesn't exist! at org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250) at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116) at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1059) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/02/10 22:17:21 INFO scheduler.TaskSetManager: Lost task 85.0 in stage 66.0 (TID 6871) on executor ip-172-31-1-124.us-west-2.compute.internal: java.io.IOException (The temporary job-output directory hdfs://ec2-54-213-142-80.us-west-2.compute.amazonaws.com:9000/output-committer-bug-22/_temporary doesn't exist!) [duplicate 1] 15/02/10 22:17:21 INFO scheduler.TaskSetManager: Lost task 90.0 in stage 66.0 (TID 6876) on executor ip-172-31-1-124.us-west-2.compute.internal: java.io.IOException (The temporary job-output directory
[jira] [Commented] (SPARK-5724) misconfiguration in Akka system
[ https://issues.apache.org/jira/browse/SPARK-5724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315135#comment-14315135 ] Apache Spark commented on SPARK-5724: - User 'CodingCat' has created a pull request for this issue: https://github.com/apache/spark/pull/4512 misconfiguration in Akka system --- Key: SPARK-5724 URL: https://issues.apache.org/jira/browse/SPARK-5724 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Nan Zhu In AkkaUtil, we set several failure detector related the parameters as following {code:title=AkkaUtil.scala|borderStyle=solid} al akkaConf = ConfigFactory.parseMap(conf.getAkkaConf.toMap[String, String]) .withFallback(akkaSslConfig).withFallback(ConfigFactory.parseString( s |akka.daemonic = on |akka.loggers = [akka.event.slf4j.Slf4jLogger] |akka.stdout-loglevel = ERROR |akka.jvm-exit-on-fatal-error = off |akka.remote.require-cookie = $requireCookie |akka.remote.secure-cookie = $secureCookie |akka.remote.transport-failure-detector.heartbeat-interval = $akkaHeartBeatInterval s |akka.remote.transport-failure-detector.acceptable-heartbeat-pause = $akkaHeartBeatPauses s |akka.remote.transport-failure-detector.threshold = $akkaFailureDetector |akka.actor.provider = akka.remote.RemoteActorRefProvider |akka.remote.netty.tcp.transport-class = akka.remote.transport.netty.NettyTransport |akka.remote.netty.tcp.hostname = $host |akka.remote.netty.tcp.port = $port |akka.remote.netty.tcp.tcp-nodelay = on |akka.remote.netty.tcp.connection-timeout = $akkaTimeout s |akka.remote.netty.tcp.maximum-frame-size = ${akkaFrameSize}B |akka.remote.netty.tcp.execution-pool-size = $akkaThreads |akka.actor.default-dispatcher.throughput = $akkaBatchSize |akka.log-config-on-start = $logAkkaConfig |akka.remote.log-remote-lifecycle-events = $lifecycleEvents |akka.log-dead-letters = $lifecycleEvents |akka.log-dead-letters-during-shutdown = $lifecycleEvents .stripMargin)) {code} Actually, we do not have any parameter naming akka.remote.transport-failure-detector.threshold see: http://doc.akka.io/docs/akka/2.3.4/general/configuration.html what we have is akka.remote.watch-failure-detector.threshold -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4261) make right version info for beeline
[ https://issues.apache.org/jira/browse/SPARK-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4261: - Priority: Trivial (was: Major) make right version info for beeline --- Key: SPARK-4261 URL: https://issues.apache.org/jira/browse/SPARK-4261 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 1.1.0 Reporter: wangfei Priority: Trivial Running with spark sql jdbc/odbc, the output will be JackydeMacBook-Pro:spark1 jackylee$ bin/beeline Spark assembly has been built with Hive, including Datanucleus jars on classpath Beeline version ??? by Apache Hive we should make right version info for beeline -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4261) make right version info for beeline
[ https://issues.apache.org/jira/browse/SPARK-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4261. -- Resolution: Won't Fix Per PR discussion, this is a cosmetic issue, and can't really be resolved in the context of a Spark assembly since the code is looking at the JAR's Manifest version, which will be Spark's at best. make right version info for beeline --- Key: SPARK-4261 URL: https://issues.apache.org/jira/browse/SPARK-4261 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 1.1.0 Reporter: wangfei Priority: Trivial Running with spark sql jdbc/odbc, the output will be JackydeMacBook-Pro:spark1 jackylee$ bin/beeline Spark assembly has been built with Hive, including Datanucleus jars on classpath Beeline version ??? by Apache Hive we should make right version info for beeline -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled
[ https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Twinkle Sachdeva updated SPARK-4705: Attachment: Screen Shot 2015-02-10 at 6.27.49 pm.png UI-2 Driver retries in yarn-cluster mode always fail if event logging is enabled --- Key: SPARK-4705 URL: https://issues.apache.org/jira/browse/SPARK-4705 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Marcelo Vanzin Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png yarn-cluster mode will retry to run the driver in certain failure modes. If even logging is enabled, this will most probably fail, because: {noformat} Exception in thread Driver java.io.IOException: Log directory hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003 already exists! at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:353) {noformat} The even log path should be more unique. Or perhaps retries of the same app should clean up the old logs first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5718) Change to native offset management for ReliableKafkaReceiver
Saisai Shao created SPARK-5718: -- Summary: Change to native offset management for ReliableKafkaReceiver Key: SPARK-5718 URL: https://issues.apache.org/jira/browse/SPARK-5718 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Saisai Shao Kafka 0.8.2 supports native offsets management instead of ZK, this will get better performance, for now in ReliableKafkaReceiver, we rely on ZK to manage the offsets, this potentially will be a bottleneck if the injection rate is high (once per 200ms by default), so here in order to get better performance as well as keeping consistent with Kafka, add native offset management for ReliableKafkaReceiver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled
[ https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Twinkle Sachdeva updated SPARK-4705: Attachment: Screen Shot 2015-02-10 at 6.27.49 pm.png UI - 2 Driver retries in yarn-cluster mode always fail if event logging is enabled --- Key: SPARK-4705 URL: https://issues.apache.org/jira/browse/SPARK-4705 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Marcelo Vanzin Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png, multi-attempts with no attempt based UI.png yarn-cluster mode will retry to run the driver in certain failure modes. If even logging is enabled, this will most probably fail, because: {noformat} Exception in thread Driver java.io.IOException: Log directory hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003 already exists! at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:353) {noformat} The even log path should be more unique. Or perhaps retries of the same app should clean up the old logs first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled
[ https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14314140#comment-14314140 ] Twinkle Sachdeva commented on SPARK-4705: - Hi, So here is the final approach I have taken regarding UI. If there is no application, where logging of event is happening per attempt, then previous UI will continue to appear. As soon as there is one or more application, whose events has been logged per attempt ( even if there is only one attempt), then UI will change to per attempt UI ( please see the attachment). By logging per attempt, I meant the changed folder structure. Please note that in case of no attempt specific UI, anchor was on application id value. In the new UI( UI - 2 ) , anchor will appear for attempt ID. Thanks, Driver retries in yarn-cluster mode always fail if event logging is enabled --- Key: SPARK-4705 URL: https://issues.apache.org/jira/browse/SPARK-4705 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Marcelo Vanzin Attachments: Screen Shot 2015-02-10 at 6.27.49 pm.png, multi-attempts with no attempt based UI.png yarn-cluster mode will retry to run the driver in certain failure modes. If even logging is enabled, this will most probably fail, because: {noformat} Exception in thread Driver java.io.IOException: Log directory hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003 already exists! at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:353) {noformat} The even log path should be more unique. Or perhaps retries of the same app should clean up the old logs first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled
[ https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Twinkle Sachdeva updated SPARK-4705: Attachment: (was: multi-attempts with no attempt based UI.png) Driver retries in yarn-cluster mode always fail if event logging is enabled --- Key: SPARK-4705 URL: https://issues.apache.org/jira/browse/SPARK-4705 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Marcelo Vanzin yarn-cluster mode will retry to run the driver in certain failure modes. If even logging is enabled, this will most probably fail, because: {noformat} Exception in thread Driver java.io.IOException: Log directory hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003 already exists! at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:353) {noformat} The even log path should be more unique. Or perhaps retries of the same app should clean up the old logs first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4705) Driver retries in yarn-cluster mode always fail if event logging is enabled
[ https://issues.apache.org/jira/browse/SPARK-4705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Twinkle Sachdeva updated SPARK-4705: Attachment: (was: Screen Shot 2015-02-10 at 6.27.49 pm.png) Driver retries in yarn-cluster mode always fail if event logging is enabled --- Key: SPARK-4705 URL: https://issues.apache.org/jira/browse/SPARK-4705 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Marcelo Vanzin Attachments: multi-attempts with no attempt based UI.png yarn-cluster mode will retry to run the driver in certain failure modes. If even logging is enabled, this will most probably fail, because: {noformat} Exception in thread Driver java.io.IOException: Log directory hdfs://vanzin-krb-1.vpc.cloudera.com:8020/user/spark/applicationHistory/application_1417554558066_0003 already exists! at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:129) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:353) {noformat} The even log path should be more unique. Or perhaps retries of the same app should clean up the old logs first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5726) Hadamard Vector Product Transformer
Octavian Geagla created SPARK-5726: -- Summary: Hadamard Vector Product Transformer Key: SPARK-5726 URL: https://issues.apache.org/jira/browse/SPARK-5726 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Octavian Geagla I originally posted my idea here: http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-td10265.html A draft of this feature is implemented, documented, and tested already. Code is on a branch on my fork here: https://github.com/ogeagla/spark/compare/spark-mllib-weighting I'm curious if there is any interest in this feature, in which case I'd appreciate some feedback. One thing that might be useful is an example/test case using the transformer within the ML pipeline, since there are not any examples which use Vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5725) ParquetRelation2.equals throws when compared with non-Parquet relations
Cheng Lian created SPARK-5725: - Summary: ParquetRelation2.equals throws when compared with non-Parquet relations Key: SPARK-5725 URL: https://issues.apache.org/jira/browse/SPARK-5725 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian it's an apparent mistake, [forgot to return {{false}} in other cases|https://github.com/apache/spark/blob/5820961289eb98e45eb467efa316c7592b8d619c/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L150-L155]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5725) ParquetRelation2.equals throws when compared with non-Parquet relations
[ https://issues.apache.org/jira/browse/SPARK-5725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315155#comment-14315155 ] Apache Spark commented on SPARK-5725: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/4513 ParquetRelation2.equals throws when compared with non-Parquet relations --- Key: SPARK-5725 URL: https://issues.apache.org/jira/browse/SPARK-5725 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian it's an apparent mistake, [forgot to return {{false}} in other cases|https://github.com/apache/spark/blob/5820961289eb98e45eb467efa316c7592b8d619c/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L150-L155]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4682) Consolidate various 'Clock' classes
[ https://issues.apache.org/jira/browse/SPARK-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315162#comment-14315162 ] Apache Spark commented on SPARK-4682: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4514 Consolidate various 'Clock' classes --- Key: SPARK-4682 URL: https://issues.apache.org/jira/browse/SPARK-4682 Project: Spark Issue Type: Improvement Components: Spark Core, Streaming Reporter: Josh Rosen Spark currently has at four different {{Clock}} classes for mocking out wall-clock time, most of which are nearly identical. We should replace all of these by one Clock class that lives in the utilities package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5644) Delete tmp dir when sc is stop
[ https://issues.apache.org/jira/browse/SPARK-5644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5644. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4412 [https://github.com/apache/spark/pull/4412] Delete tmp dir when sc is stop -- Key: SPARK-5644 URL: https://issues.apache.org/jira/browse/SPARK-5644 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Weizhong Priority: Minor Fix For: 1.4.0 When we run driver as a service which will not stop. In this service process we will create SparkContext and run job and then stop it, because we only call sc.stop but not exit this service process so the tmp dirs created by HttpFileServer and SparkEnv will not be deleted after SparkContext is stopped, and this will lead to creating too many tmp dirs if we create many SparkContext to run job in this service process. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5727) Deprecate, remove Debian packaging
[ https://issues.apache.org/jira/browse/SPARK-5727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315294#comment-14315294 ] Apache Spark commented on SPARK-5727: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/4516 Deprecate, remove Debian packaging -- Key: SPARK-5727 URL: https://issues.apache.org/jira/browse/SPARK-5727 Project: Spark Issue Type: Task Components: Build, Deploy Affects Versions: 1.2.1 Reporter: Sean Owen Assignee: Sean Owen Per discussion on the mailing list (https://www.mail-archive.com/dev@spark.apache.org/msg07598.html), this JIRA proposes: - For 1.3.x, deprecate the Debian packaging (the {{deb}} profile) by adding a warning message of some kind when invoking the profile - For 1.4.x, remove the packaging Two PRs coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308620#comment-14308620 ] Kevin Jung edited comment on SPARK-5081 at 2/11/15 12:56 AM: - To test under the same condition, I set this to snappy for all spark version but this problem occurs. AFAIK, lz4 needs more CPU time than snappy but it has better compression ratio. was (Author: kallsu): To test under the same condition, I set this to snappy for all spark version but this problem occurs. AFA I know, lz4 needs more CPU time than snappy but it has better compression ratio. Shuffle write increases --- Key: SPARK-5081 URL: https://issues.apache.org/jira/browse/SPARK-5081 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Kevin Jung The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger and it causes the jobs take more time to complete. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. spark 1.1 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |9|saveAsTextFile| |1169.4KB| | |12|combineByKey| |1265.4KB|1275.0KB| |6|sortByKey| |1276.5KB| | |8|mapPartitions| |91.0MB|1383.1KB| |4|apply| |89.4MB| | |5|sortBy|155.6MB| |98.1MB| |3|sortBy|155.6MB| | | |1|collect| |2.1MB| | |2|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | spark 1.2 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |12|saveAsTextFile| |1170.2KB| | |11|combineByKey| |1264.5KB|1275.0KB| |8|sortByKey| |1273.6KB| | |7|mapPartitions| |134.5MB|1383.1KB| |5|zipWithIndex| |132.5MB| | |4|sortBy|155.6MB| |146.9MB| |3|sortBy|155.6MB| | | |2|collect| |2.0MB| | |1|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5243) Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster
[ https://issues.apache.org/jira/browse/SPARK-5243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-5243: -- Description: Spark will hang if calling spark-submit under the conditions: 1. the cluster has only one worker. 2. driver memory + executor memory worker memory 3. deploy-mode = cluster This usually happens during development for beginners. There should be some exit mechanism or at least a warning message in the output of the spark-submit. I would like to know your opinions about if a fix is needed and better fix options. was: Spark will hang if calling spark-submit under the conditions: 1. the cluster has only one worker. 2. driver memory + executor memory worker memory 3. deploy-mode = cluster This usually happens during development for beginners. There should be some exit mechanism or at least a warning message in the output of the spark-submit. I am preparing PR for the case. And I would like to know your opinions about if a fix is needed and better fix options. Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster Key: SPARK-5243 URL: https://issues.apache.org/jira/browse/SPARK-5243 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.2.0 Environment: centos, others should be similar Reporter: yuhao yang Priority: Minor Spark will hang if calling spark-submit under the conditions: 1. the cluster has only one worker. 2. driver memory + executor memory worker memory 3. deploy-mode = cluster This usually happens during development for beginners. There should be some exit mechanism or at least a warning message in the output of the spark-submit. I would like to know your opinions about if a fix is needed and better fix options. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5682) Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang_intel updated SPARK-5682: Comment: was deleted (was: encrypted_shuffle.patch.4 is how to reuse hadoop encrypted class to enable spark encrypted shuffle. How to use patch -p1encrypted_shuffle.patch.4) Reuse hadoop encrypted shuffle algorithm to enable spark encrypted shuffle -- Key: SPARK-5682 URL: https://issues.apache.org/jira/browse/SPARK-5682 Project: Spark Issue Type: New Feature Components: Shuffle Reporter: liyunzhang_intel Attachments: Design Document of Encrypted Spark Shuffle_20150209.docx Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle data safer. This feature is necessary in spark. We reuse hadoop encrypted shuffle feature to spark and because ugi credential info is necessary in encrypted shuffle, we first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5243) Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster
[ https://issues.apache.org/jira/browse/SPARK-5243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-5243: -- Description: Spark will hang if calling spark-submit under the conditions: 1. the cluster has only one worker. 2. driver memory + executor memory worker memory 3. deploy-mode = cluster This usually happens during development for beginners. There should be some exit mechanism or at least a warning message in the output of the spark-submit. I would like to know your opinions about if a fix is needed (is this by design?) and better fix options. was: Spark will hang if calling spark-submit under the conditions: 1. the cluster has only one worker. 2. driver memory + executor memory worker memory 3. deploy-mode = cluster This usually happens during development for beginners. There should be some exit mechanism or at least a warning message in the output of the spark-submit. I would like to know your opinions about if a fix is needed and better fix options. Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster Key: SPARK-5243 URL: https://issues.apache.org/jira/browse/SPARK-5243 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.2.0 Environment: centos, others should be similar Reporter: yuhao yang Priority: Minor Spark will hang if calling spark-submit under the conditions: 1. the cluster has only one worker. 2. driver memory + executor memory worker memory 3. deploy-mode = cluster This usually happens during development for beginners. There should be some exit mechanism or at least a warning message in the output of the spark-submit. I would like to know your opinions about if a fix is needed (is this by design?) and better fix options. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5725) ParquetRelation2.equals throws when compared with non-Parquet relations
[ https://issues.apache.org/jira/browse/SPARK-5725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-5725. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4513 [https://github.com/apache/spark/pull/4513] ParquetRelation2.equals throws when compared with non-Parquet relations --- Key: SPARK-5725 URL: https://issues.apache.org/jira/browse/SPARK-5725 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Fix For: 1.3.0 it's an apparent mistake, [forgot to return {{false}} in other cases|https://github.com/apache/spark/blob/5820961289eb98e45eb467efa316c7592b8d619c/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala#L150-L155]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5718) Add native offset management for ReliableKafkaReceiver
[ https://issues.apache.org/jira/browse/SPARK-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-5718: --- Summary: Add native offset management for ReliableKafkaReceiver (was: Change to native offset management for ReliableKafkaReceiver) Add native offset management for ReliableKafkaReceiver -- Key: SPARK-5718 URL: https://issues.apache.org/jira/browse/SPARK-5718 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.0 Reporter: Saisai Shao Kafka 0.8.2 supports native offsets management instead of ZK, this will get better performance, for now in ReliableKafkaReceiver, we rely on ZK to manage the offsets, this potentially will be a bottleneck if the injection rate is high (once per 200ms by default), so here in order to get better performance as well as keeping consistent with Kafka, add native offset management for ReliableKafkaReceiver. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org