[jira] [Created] (SPARK-3611) Show number of cores for each executor in application web UI
Matei Zaharia created SPARK-3611: Summary: Show number of cores for each executor in application web UI Key: SPARK-3611 URL: https://issues.apache.org/jira/browse/SPARK-3611 Project: Spark Issue Type: New Feature Components: Web UI Reporter: Matei Zaharia Priority: Minor This number is not always fully known, because e.g. in Mesos your executors can scale up and down in # of CPUs, but it would be nice to show at least the number of cores the machine has in that case, or the # of cores the executor has been configured with if known. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3612) Executor shouldn't quit if heartbeat message fails to reach the driver
Reynold Xin created SPARK-3612: -- Summary: Executor shouldn't quit if heartbeat message fails to reach the driver Key: SPARK-3612 URL: https://issues.apache.org/jira/browse/SPARK-3612 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin The thread started by Executor.startDriverHeartbeater can actually terminate the whole executor if AkkaUtils.askWithReply[HeartbeatResponse] throws an exception. I don't think we should quit the executor this way. At the very least, we would want to log a more meaningful exception then simply {code} 14/09/20 06:38:12 WARN AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) 14/09/20 06:38:45 WARN AkkaUtils: Error sending message in 2 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) 14/09/20 06:39:18 WARN AkkaUtils: Error sending message in 3 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) 14/09/20 06:39:21 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Driver Heartbeater,5,main] org.apache.spark.SparkException: Error sending message [message = Heartbeat(281,[Lscala.Tuple2;@4d9294db,BlockManagerId(281, ip-172-31-7-55.eu-west-1.compute.internal, 52303))] at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:190) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) ... 1 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3612) Executor shouldn't quit if heartbeat message fails to reach the driver
[ https://issues.apache.org/jira/browse/SPARK-3612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141807#comment-14141807 ] Reynold Xin commented on SPARK-3612: [~andrewor14] [~sandyryza] any comment on this? I think you guys worked on this code. Executor shouldn't quit if heartbeat message fails to reach the driver -- Key: SPARK-3612 URL: https://issues.apache.org/jira/browse/SPARK-3612 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin The thread started by Executor.startDriverHeartbeater can actually terminate the whole executor if AkkaUtils.askWithReply[HeartbeatResponse] throws an exception. I don't think we should quit the executor this way. At the very least, we would want to log a more meaningful exception then simply {code} 14/09/20 06:38:12 WARN AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) 14/09/20 06:38:45 WARN AkkaUtils: Error sending message in 2 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) 14/09/20 06:39:18 WARN AkkaUtils: Error sending message in 3 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) 14/09/20 06:39:21 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Driver Heartbeater,5,main] org.apache.spark.SparkException: Error sending message [message = Heartbeat(281,[Lscala.Tuple2;@4d9294db,BlockManagerId(281, ip-172-31-7-55.eu-west-1.compute.internal, 52303))] at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:190) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) ... 1 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3613) Don't record the size of each shuffle block for large jobs
Reynold Xin created SPARK-3613: -- Summary: Don't record the size of each shuffle block for large jobs Key: SPARK-3613 URL: https://issues.apache.org/jira/browse/SPARK-3613 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin MapStatus saves the size of each block (1 byte per block) for a particular map task. This actually means the shuffle metadata is O(M*R), where M = num maps and R = num reduces. If M is greater than a certain size, we should probably just send an average size instead of a whole array. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3613) Don't record the size of each shuffle block for large jobs
[ https://issues.apache.org/jira/browse/SPARK-3613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141870#comment-14141870 ] Apache Spark commented on SPARK-3613: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2470 Don't record the size of each shuffle block for large jobs -- Key: SPARK-3613 URL: https://issues.apache.org/jira/browse/SPARK-3613 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Assignee: Reynold Xin MapStatus saves the size of each block (1 byte per block) for a particular map task. This actually means the shuffle metadata is O(M*R), where M = num maps and R = num reduces. If M is greater than a certain size, we should probably just send an average size instead of a whole array. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3608) Spark EC2 Script does not correctly break when AWS tagging succeeds.
[ https://issues.apache.org/jira/browse/SPARK-3608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3608: --- Assignee: Vida Ha Spark EC2 Script does not correctly break when AWS tagging succeeds. Key: SPARK-3608 URL: https://issues.apache.org/jira/browse/SPARK-3608 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Vida Ha Assignee: Vida Ha Priority: Critical Fix For: 1.2.0 Spark EC2 script will tag 5 times and not break out correctly if things succeed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3608) Spark EC2 Script does not correctly break when AWS tagging succeeds.
[ https://issues.apache.org/jira/browse/SPARK-3608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-3608. Resolution: Fixed Fix Version/s: 1.2.0 Target Version/s: 1.2.0 Spark EC2 Script does not correctly break when AWS tagging succeeds. Key: SPARK-3608 URL: https://issues.apache.org/jira/browse/SPARK-3608 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.1.0 Reporter: Vida Ha Assignee: Vida Ha Priority: Critical Fix For: 1.2.0 Spark EC2 script will tag 5 times and not break out correctly if things succeed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3562) Periodic cleanup event logs
[ https://issues.apache.org/jira/browse/SPARK-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14141876#comment-14141876 ] Apache Spark commented on SPARK-3562: - User 'viper-kun' has created a pull request for this issue: https://github.com/apache/spark/pull/2471 Periodic cleanup event logs --- Key: SPARK-3562 URL: https://issues.apache.org/jira/browse/SPARK-3562 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: xukun If we run spark application frequently, it will write many spark event log into spark.eventLog.dir. After a long time later, there will be many spark event log that we do not concern in the spark.eventLog.dir.Periodic cleanups will ensure that logs older than this duration will be forgotten. It is no need to clean logs by hands. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3612) Executor shouldn't quit if heartbeat message fails to reach the driver
[ https://issues.apache.org/jira/browse/SPARK-3612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142006#comment-14142006 ] Sandy Ryza commented on SPARK-3612: --- Yeah, we should catch this. Will post a patch. Executor shouldn't quit if heartbeat message fails to reach the driver -- Key: SPARK-3612 URL: https://issues.apache.org/jira/browse/SPARK-3612 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Reynold Xin The thread started by Executor.startDriverHeartbeater can actually terminate the whole executor if AkkaUtils.askWithReply[HeartbeatResponse] throws an exception. I don't think we should quit the executor this way. At the very least, we would want to log a more meaningful exception then simply {code} 14/09/20 06:38:12 WARN AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) 14/09/20 06:38:45 WARN AkkaUtils: Error sending message in 2 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) 14/09/20 06:39:18 WARN AkkaUtils: Error sending message in 3 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) 14/09/20 06:39:21 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Driver Heartbeater,5,main] org.apache.spark.SparkException: Error sending message [message = Heartbeat(281,[Lscala.Tuple2;@4d9294db,BlockManagerId(281, ip-172-31-7-55.eu-west-1.compute.internal, 52303))] at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:190) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:379) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:176) ... 1 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3614) Filter on minimum occurrences of a term in IDF
[ https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jatinpreet Singh updated SPARK-3614: Description: The IDF class in MLlib does not provide the capability of defining a minimum number of documents a term should appear in the corpus. The idea is to have a cutoff variable which defines this minimum occurrence value, and the terms which have lower frequency are ignored. Mathematically, IDF(t,D)=(log|D|+1) / (DF(t,D)+1), for DF(t,D) =minimumOccurance where, D is the total number of documents in the corpus DF(t,D) is the number of documents that contains term t minimumOccurance is the minimum number of documents the term appears in the document corpus This would have an impact on accuracy as terms that appear in less than a certain limit of documents, have low or no importance in TFIDF vectors. was: The IDF class in MLlib does not provide the capability of defining a minimum number of documents a term should appear in the corpus. The idea is to have a cutoff variable which defines this minimum occurrence value, and the terms which have lower frequency are ignored. Mathematically, IDF(t,D)=(log|D|+1) / (DF(t,D)+1), for DF(t,D) =minimumOccurance where, |D| is the total number of documents in the corpus DF(t,D) is the number of documents that contains term t minimumOccurance is the minimum number of documents the term appears in the document corpus This would have an impact on accuracy as terms that appear in less than a certain limit of documents, have low or no importance in TFIDF vectors. Filter on minimum occurrences of a term in IDF --- Key: SPARK-3614 URL: https://issues.apache.org/jira/browse/SPARK-3614 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Jatinpreet Singh Priority: Minor Labels: TFIDF The IDF class in MLlib does not provide the capability of defining a minimum number of documents a term should appear in the corpus. The idea is to have a cutoff variable which defines this minimum occurrence value, and the terms which have lower frequency are ignored. Mathematically, IDF(t,D)=(log|D|+1) / (DF(t,D)+1), for DF(t,D) =minimumOccurance where, D is the total number of documents in the corpus DF(t,D) is the number of documents that contains term t minimumOccurance is the minimum number of documents the term appears in the document corpus This would have an impact on accuracy as terms that appear in less than a certain limit of documents, have low or no importance in TFIDF vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3614) Filter on minimum occurrences of a term in IDF
[ https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jatinpreet Singh updated SPARK-3614: Description: The IDF class in MLlib does not provide the capability of defining a minimum number of documents a term should appear in the corpus. The idea is to have a cutoff variable which defines this minimum occurrence value, and the terms which have lower frequency are ignored. Mathematically, IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) =minimumOccurance where, D is the total number of documents in the corpus DF(t,D) is the number of documents that contains term t minimumOccurance is the minimum number of documents the term appears in the document corpus This would have an impact on accuracy as terms that appear in less than a certain limit of documents, have low or no importance in TFIDF vectors. was: The IDF class in MLlib does not provide the capability of defining a minimum number of documents a term should appear in the corpus. The idea is to have a cutoff variable which defines this minimum occurrence value, and the terms which have lower frequency are ignored. Mathematically, IDF(t,D)=(log|D|+1) / (DF(t,D)+1), for DF(t,D) =minimumOccurance where, D is the total number of documents in the corpus DF(t,D) is the number of documents that contains term t minimumOccurance is the minimum number of documents the term appears in the document corpus This would have an impact on accuracy as terms that appear in less than a certain limit of documents, have low or no importance in TFIDF vectors. Filter on minimum occurrences of a term in IDF --- Key: SPARK-3614 URL: https://issues.apache.org/jira/browse/SPARK-3614 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Jatinpreet Singh Priority: Minor Labels: TFIDF The IDF class in MLlib does not provide the capability of defining a minimum number of documents a term should appear in the corpus. The idea is to have a cutoff variable which defines this minimum occurrence value, and the terms which have lower frequency are ignored. Mathematically, IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) =minimumOccurance where, D is the total number of documents in the corpus DF(t,D) is the number of documents that contains term t minimumOccurance is the minimum number of documents the term appears in the document corpus This would have an impact on accuracy as terms that appear in less than a certain limit of documents, have low or no importance in TFIDF vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3614) Filter on minimum occurrences of a term in IDF
[ https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jatinpreet Singh updated SPARK-3614: Description: The IDF class in MLlib does not provide the capability of defining a minimum number of documents a term should appear in the corpus. The idea is to have a cutoff variable which defines this minimum occurrence value, and the terms which have lower frequency are ignored. Mathematically, IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) =minimumOccurance where, D is the total number of documents in the corpus DF(t,D) is the number of documents that contain the term t minimumOccurance is the minimum number of documents the term appears in the document corpus This would have an impact on accuracy as terms that appear in less than a certain limit of documents, have low or no importance in TFIDF vectors. was: The IDF class in MLlib does not provide the capability of defining a minimum number of documents a term should appear in the corpus. The idea is to have a cutoff variable which defines this minimum occurrence value, and the terms which have lower frequency are ignored. Mathematically, IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) =minimumOccurance where, D is the total number of documents in the corpus DF(t,D) is the number of documents that contains term t minimumOccurance is the minimum number of documents the term appears in the document corpus This would have an impact on accuracy as terms that appear in less than a certain limit of documents, have low or no importance in TFIDF vectors. Filter on minimum occurrences of a term in IDF --- Key: SPARK-3614 URL: https://issues.apache.org/jira/browse/SPARK-3614 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Jatinpreet Singh Priority: Minor Labels: TFIDF The IDF class in MLlib does not provide the capability of defining a minimum number of documents a term should appear in the corpus. The idea is to have a cutoff variable which defines this minimum occurrence value, and the terms which have lower frequency are ignored. Mathematically, IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) =minimumOccurance where, D is the total number of documents in the corpus DF(t,D) is the number of documents that contain the term t minimumOccurance is the minimum number of documents the term appears in the document corpus This would have an impact on accuracy as terms that appear in less than a certain limit of documents, have low or no importance in TFIDF vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3615) Kafka test should not hard code Zookeeper port
Patrick Wendell created SPARK-3615: -- Summary: Kafka test should not hard code Zookeeper port Key: SPARK-3615 URL: https://issues.apache.org/jira/browse/SPARK-3615 Project: Spark Issue Type: Bug Components: Streaming Reporter: Patrick Wendell Assignee: Saisai Shao This is causing failures in our master build if port 2181 is contented. Instead of binding to a static port we should re-factor this such that it opens a socket on port 0 and then reads back the port. So we can never have contention. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3610) Unable to load app logs for MLLib programs in history server
[ https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3610: --- Priority: Critical (was: Major) Unable to load app logs for MLLib programs in history server - Key: SPARK-3610 URL: https://issues.apache.org/jira/browse/SPARK-3610 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: SK Priority: Critical Fix For: 1.1.0 The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032. When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3610) History server log name should not be based on user input
[ https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3610: --- Description: Right now we don't have a Original bug report: The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032. When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? was: The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032. When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? History server log name should not be based on user input - Key: SPARK-3610 URL: https://issues.apache.org/jira/browse/SPARK-3610 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: SK Priority: Critical Fix For: 1.1.0 Right now we don't have a Original bug report: The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032. When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3610) History server log name should not be based on user input
[ https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3610: --- Description: Right now we use the user-defined application name when creating the logging file for the history server. We should use some type of GUID generated from inside of Spark instead of allowing user input here. It can cause errors if users provide characters that are not valid in filesystem paths. Original bug report: {quote} The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032. When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? {quote} was: Right now we use the user-defined application name when creating the logging file for the history server. We should use some type of GUID generated from inside of Spark instead of allowing user input here. It can cause errors if users provide characters that are not valid in filesystem paths. Original bug report: The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032. When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? History server log name should not be based on user input - Key: SPARK-3610 URL: https://issues.apache.org/jira/browse/SPARK-3610 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: SK Priority: Critical Fix For: 1.1.0 Right now we use the user-defined application name when creating the logging file for the history server. We should use some type of GUID generated from inside of Spark instead of allowing user input here. It can cause errors if users provide characters that are not valid in filesystem paths. Original bug report: {quote} The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032. When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3610) History server log name should not be based on user input
[ https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3610: --- Description: Right now we use the user-defined application name when creating the logging file for the history server. We should use some type of GUID generated from inside of Spark instead of allowing user input here. It can cause errors if users provide characters that are not valid in filesystem paths. Original bug report: The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032. When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? was: Right now we don't have a Original bug report: The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032. When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? History server log name should not be based on user input - Key: SPARK-3610 URL: https://issues.apache.org/jira/browse/SPARK-3610 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: SK Priority: Critical Fix For: 1.1.0 Right now we use the user-defined application name when creating the logging file for the history server. We should use some type of GUID generated from inside of Spark instead of allowing user input here. It can cause errors if users provide characters that are not valid in filesystem paths. Original bug report: The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032. When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3610) History server log name should not be based on user input
[ https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3610: -- Component/s: Web UI History server log name should not be based on user input - Key: SPARK-3610 URL: https://issues.apache.org/jira/browse/SPARK-3610 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: SK Priority: Critical Right now we use the user-defined application name when creating the logging file for the history server. We should use some type of GUID generated from inside of Spark instead of allowing user input here. It can cause errors if users provide characters that are not valid in filesystem paths. Original bug report: {quote} The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032. When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3610) History server log name should not be based on user input
[ https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3610: --- Fix Version/s: (was: 1.1.0) History server log name should not be based on user input - Key: SPARK-3610 URL: https://issues.apache.org/jira/browse/SPARK-3610 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: SK Priority: Critical Right now we use the user-defined application name when creating the logging file for the history server. We should use some type of GUID generated from inside of Spark instead of allowing user input here. It can cause errors if users provide characters that are not valid in filesystem paths. Original bug report: {quote} The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032. When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3610) History server log name should not be based on user input
[ https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3610: --- Component/s: (was: Web UI) Target Version/s: 1.2.0 History server log name should not be based on user input - Key: SPARK-3610 URL: https://issues.apache.org/jira/browse/SPARK-3610 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: SK Priority: Critical Right now we use the user-defined application name when creating the logging file for the history server. We should use some type of GUID generated from inside of Spark instead of allowing user input here. It can cause errors if users provide characters that are not valid in filesystem paths. Original bug report: {quote} The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032. When I click on the program on the history server page (at port 18080), to view the detailed application logs, the history server crashes and I need to restart it. I am using Spark 1.1 on a mesos cluster. I renamed the log file by removing the special characters and then it loads up correctly. I am not sure which program is creating the log files. Can it be changed so that the default log file naming convention does not include special characters? {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3609) Add sizeInBytes statistics to Limit operator
[ https://issues.apache.org/jira/browse/SPARK-3609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3609. - Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2468 [https://github.com/apache/spark/pull/2468] Add sizeInBytes statistics to Limit operator Key: SPARK-3609 URL: https://issues.apache.org/jira/browse/SPARK-3609 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian Assignee: Cheng Lian Fix For: 1.2.0 The {{sizeInBytes}} statistics of a {{LIMIT}} operator can be estimated fairly precisely when all output attributes are of native data types, all native data types except {{StringType}} have fixed size. For {{StringType}}, we can use a relatively large (say 4K) default size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3617) Configurable case sensitivity
Michael Armbrust created SPARK-3617: --- Summary: Configurable case sensitivity Key: SPARK-3617 URL: https://issues.apache.org/jira/browse/SPARK-3617 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Right now SQLContext is case sensitive and HiveContext is not. It would be better to make it configurable in both instances. All of the underlying plumbing is there we just need to expose an option. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3618) Store analyzed plans for temp tables
Michael Armbrust created SPARK-3618: --- Summary: Store analyzed plans for temp tables Key: SPARK-3618 URL: https://issues.apache.org/jira/browse/SPARK-3618 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Right now we store unanalyzed logical plans for temporary tables. However this means that changes to session state (e.g., the current database) could result in tables becoming inaccessible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2274) spark SQL query hang up sometimes
[ https://issues.apache.org/jira/browse/SPARK-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2274. - Resolution: Fixed Fix Version/s: 1.1.0 The attached query is using a left outer join, which until 1.1 was executed very slowly. Please reopen this issue if problems persist on a newer version of Spark. spark SQL query hang up sometimes - Key: SPARK-2274 URL: https://issues.apache.org/jira/browse/SPARK-2274 Project: Spark Issue Type: Question Components: SQL Environment: spark 1.0.0 Reporter: jackielihf Assignee: Michael Armbrust Fix For: 1.1.0 when I run spark SQL query, it hang up sometimes: 1) simple SQL query works, such as select * from a left out join b on a.id=b.id 2) BUT if it has more joins, such as select * from a left out join b on a.id=b.id left out join c on a.id=c.id..., spark shell seems to hang up. spark shell prints: scala hc.hql(select A.id,A.tit,A.sub_tit,B.abst,B.cont,A.aut,A.com_name,A.med_name,A.pub_dt,A.upd_time,A.ent_time,A.info_lvl,A.is_pic,A.lnk_addr,A.is_ann,A.info_open_lvl,A.keyw_name,C.typ_code as type,D.evt_dir,D.evt_st,E.secu_id,E.typ_codeii,E.exch_code,E.trd_code,E.secu_sht from txt_nws_bas_update A left outer join txt_nws_bas_txt B on A.id=B.orig_id left outer join txt_nws_typ C on A.id=C.orig_id left outer join txt_nws_secu D on A.id=D.orig_id left outer join bas_secu_info E on D.secu_id=E.secu_id where D.secu_id is not null limit 5).foreach(println) 14/06/25 13:32:25 INFO ParseDriver: Parsing command: select A.id,A.tit,A.sub_tit,B.abst,B.cont,A.aut,A.com_name,A.med_name,A.pub_dt,A.upd_time,A.ent_time,A.info_lvl,A.is_pic,A.lnk_addr,A.is_ann,A.info_open_lvl,A.keyw_name,C.typ_code as type,D.evt_dir,D.evt_st,E.secu_id,E.typ_codeii,E.exch_code,E.trd_code,E.secu_sht from txt_nws_bas_update A left outer join txt_nws_bas_txt B on A.id=B.orig_id left outer join txt_nws_typ C on A.id=C.orig_id left outer join txt_nws_secu D on A.id=D.orig_id left outer join bas_secu_info E on D.secu_id=E.secu_id where D.secu_id is not null limit 5 14/06/25 13:32:25 INFO ParseDriver: Parse Completed 14/06/25 13:32:25 INFO Analyzer: Max iterations (2) reached for batch MultiInstanceRelations 14/06/25 13:32:25 INFO Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences 14/06/25 13:32:27 INFO MemoryStore: ensureFreeSpace(220923) called with curMem=0, maxMem=311387750 14/06/25 13:32:27 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 215.7 KB, free 296.8 MB) 14/06/25 13:32:27 INFO MemoryStore: ensureFreeSpace(220971) called with curMem=220923, maxMem=311387750 14/06/25 13:32:27 INFO MemoryStore: Block broadcast_1 stored as values to memory (estimated size 215.8 KB, free 296.5 MB) 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with curMem=441894, maxMem=311387750 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_2 stored as values to memory (estimated size 215.8 KB, free 296.3 MB) 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with curMem=662865, maxMem=311387750 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_3 stored as values to memory (estimated size 215.8 KB, free 296.1 MB) 14/06/25 13:32:28 INFO MemoryStore: ensureFreeSpace(220971) called with curMem=883836, maxMem=311387750 14/06/25 13:32:28 INFO MemoryStore: Block broadcast_4 stored as values to memory (estimated size 215.8 KB, free 295.9 MB) 14/06/25 13:32:29 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Add exchange 14/06/25 13:32:29 INFO SQLContext$$anon$1: Max iterations (2) reached for batch Prepare Expressions 14/06/25 13:32:30 INFO FileInputFormat: Total input paths to process : 1 14/06/25 13:32:30 INFO SparkContext: Starting job: collect at joins.scala:184 14/06/25 13:32:30 INFO DAGScheduler: Got job 0 (collect at joins.scala:184) with 2 output partitions (allowLocal=false) 14/06/25 13:32:30 INFO DAGScheduler: Final stage: Stage 0(collect at joins.scala:184) 14/06/25 13:32:30 INFO DAGScheduler: Parents of final stage: List() 14/06/25 13:32:30 INFO DAGScheduler: Missing parents: List() 14/06/25 13:32:30 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[7] at map at joins.scala:184), which has no missing parents 14/06/25 13:32:30 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[7] at map at joins.scala:184) 14/06/25 13:32:30 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14/06/25 13:32:30 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor 0: 192.168.56.100 (PROCESS_LOCAL) 14/06/25 13:32:30 INFO TaskSetManager: Serialized task 0.0:0 as 4088 bytes in 3 ms 14/06/25 13:32:30 INFO TaskSetManager: Starting task 0.0:1 as TID
[jira] [Updated] (SPARK-3267) Deadlock between ScalaReflectionLock and Data type initialization
[ https://issues.apache.org/jira/browse/SPARK-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3267: Assignee: (was: Michael Armbrust) Deadlock between ScalaReflectionLock and Data type initialization - Key: SPARK-3267 URL: https://issues.apache.org/jira/browse/SPARK-3267 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Aaron Davidson Priority: Critical Deadlock here: {code} Executor task launch worker-0 daemon prio=10 tid=0x7fab50036000 nid=0x27a in Object.wait() [0x7fab60c2e000 ] java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.defaultPrimitive(CodeGenerator.scala:565) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:202) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$Evaluate2$2.evaluateAs(CodeGenerator.scal a:175) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:304) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:314) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:313) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) ... {code} and {code} Executor task launch worker-2 daemon prio=10 tid=0x7fab100f0800 nid=0x27e in Object.wait() [0x7fab0eeec000 ] java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:250) - locked 0x00064e5d9a48 (a org.apache.spark.sql.catalyst.expressions.Cast) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations. scala:139) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations. scala:139) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:139) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:126) at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:197) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at
[jira] [Resolved] (SPARK-3414) Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names
[ https://issues.apache.org/jira/browse/SPARK-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3414. - Resolution: Fixed Issue resolved by pull request 2382 [https://github.com/apache/spark/pull/2382] Case insensitivity breaks when unresolved relation contains attributes with uppercase letters in their names Key: SPARK-3414 URL: https://issues.apache.org/jira/browse/SPARK-3414 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2 Reporter: Cheng Lian Assignee: Michael Armbrust Priority: Critical Fix For: 1.2.0 Paste the following snippet to {{spark-shell}} (need Hive support) to reproduce this issue: {code} import org.apache.spark.sql.hive.HiveContext val hiveContext = new HiveContext(sc) import hiveContext._ case class LogEntry(filename: String, message: String) case class LogFile(name: String) sc.makeRDD(Seq.empty[LogEntry]).registerTempTable(rawLogs) sc.makeRDD(Seq.empty[LogFile]).registerTempTable(logFiles) val srdd = sql( SELECT name, message FROM rawLogs JOIN ( SELECT name FROM logFiles ) files ON rawLogs.filename = files.name ) srdd.registerTempTable(boom) sql(select * from boom) {code} Exception thrown: {code} SchemaRDD[7] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: Project [*] LowerCaseSchema Subquery boom Project ['name,'message] Join Inner, Some(('rawLogs.filename = name#2)) LowerCaseSchema Subquery rawlogs SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) Subquery files Project [name#2] LowerCaseSchema Subquery logfiles SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} Notice that {{rawLogs}} in the join operator is not lowercased. The reason is that, during analysis phase, the {{CaseInsensitiveAttributeReferences}} batch is only executed before the {{Resolution}} batch. And when {{srdd}} is registered as temporary table {{boom}}, its original (unanalyzed) logical plan is stored into the catalog: {code} Join Inner, Some(('rawLogs.filename = 'files.name)) UnresolvedRelation None, rawLogs, None Subquery files Project ['name] UnresolvedRelation None, logFiles, None {code} notice that attributes referenced in the join operator (esp. {{rawLogs}}) is not lowercased yet. And then, when {{select * from boom}} is been analyzed, its input logical plan is: {code} Project [*] UnresolvedRelation None, boom, None {code} here the unresolved relation points to the unanalyzed logical plan of {{srdd}} above, which is later discovered by rule {{ResolveRelations}}, thus not touched by {{CaseInsensitiveAttributeReferences}} at all, and {{rawLogs.filename}} is thus not lowercased: {code} === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === Project [*]Project [*] ! UnresolvedRelation None, boom, NoneLowerCaseSchema ! Subquery boom ! Project ['name,'message] ! Join Inner, Some(('rawLogs.filename = 'files.name)) !LowerCaseSchema ! Subquery rawlogs ! SparkLogicalPlan (ExistingRdd [filename#0,message#1], MapPartitionsRDD[1] at mapPartitions at basicOperators.scala:208) !Subquery files ! Project ['name] ! LowerCaseSchema ! Subquery logfiles !SparkLogicalPlan (ExistingRdd [name#2], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:208) {code} A reasonable fix for this could be always register analyzed logical plan to the catalog when registering temporary tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3619) Upgrade to Mesos 0.21 to work around MESOS-1688
Matei Zaharia created SPARK-3619: Summary: Upgrade to Mesos 0.21 to work around MESOS-1688 Key: SPARK-3619 URL: https://issues.apache.org/jira/browse/SPARK-3619 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Matei Zaharia When Mesos 0.21 comes out, it will have a fix for https://issues.apache.org/jira/browse/MESOS-1688, which affects Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3616) Add Selenium tests to Web UI
[ https://issues.apache.org/jira/browse/SPARK-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142244#comment-14142244 ] Apache Spark commented on SPARK-3616: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/2474 Add Selenium tests to Web UI Key: SPARK-3616 URL: https://issues.apache.org/jira/browse/SPARK-3616 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Josh Rosen Assignee: Josh Rosen We should add basic Selenium tests to Web UI suite. This will make it easy to write regression tests / reproductions for UI bugs and will be useful in testing some planned refactorings / redesigns that I'm working on. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD
[ https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142257#comment-14142257 ] Patrick Wendell commented on SPARK-3604: After looking at the PR - I think the issue here is just to use the built in SparkContext.union utility when unioning large numbers of RDD's. I misread this originally as a bug report, but I think it's just a usage issue. Perhaps we should document this in the normal union() scaladoc. unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD -- Key: SPARK-3604 URL: https://issues.apache.org/jira/browse/SPARK-3604 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: linux. Used python, but error is in Scala land. Reporter: Eric Friedman Priority: Critical I have a large number of parquet files all with the same schema and attempted to make a UnionRDD out of them. When I call getNumPartitions(), I get a stack overflow error that looks like this: Py4JJavaError: An error occurred while calling o3275.partitions. : java.lang.StackOverflowError at scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239) at scala.collection.TraversableLike$class.map(TraversableLike.scala:243) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3604) unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD
[ https://issues.apache.org/jira/browse/SPARK-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142257#comment-14142257 ] Patrick Wendell edited comment on SPARK-3604 at 9/21/14 12:35 AM: -- After looking at the PR - I think the solution here is just to use the built in SparkContext.union utility when unioning large numbers of RDD's. I misread this originally as a bug report, but I think it's just a usage issue. Perhaps we should document this in the normal union() scaladoc. was (Author: pwendell): After looking at the PR - I think the issue here is just to use the built in SparkContext.union utility when unioning large numbers of RDD's. I misread this originally as a bug report, but I think it's just a usage issue. Perhaps we should document this in the normal union() scaladoc. unbounded recursion in getNumPartitions triggers stack overflow for large UnionRDD -- Key: SPARK-3604 URL: https://issues.apache.org/jira/browse/SPARK-3604 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Environment: linux. Used python, but error is in Scala land. Reporter: Eric Friedman Priority: Critical I have a large number of parquet files all with the same schema and attempted to make a UnionRDD out of them. When I call getNumPartitions(), I get a stack overflow error that looks like this: Py4JJavaError: An error occurred while calling o3275.partitions. : java.lang.StackOverflowError at scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239) at scala.collection.TraversableLike$class.map(TraversableLike.scala:243) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:65) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:65) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3620) Refactor parameter handling code for spark-submit
Dale Richardson created SPARK-3620: -- Summary: Refactor parameter handling code for spark-submit Key: SPARK-3620 URL: https://issues.apache.org/jira/browse/SPARK-3620 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.1.0, 1.0.0 Reporter: Dale Richardson Priority: Minor I'm proposing its time to refactor the configuration argument handling code in spark-submit. The code has grown organically in a short period of time, handles a pretty complicated logic flow, and is now pretty fragile. Some issues that have been identified: 1. Hand-crafted property file readers that do not support the property file format as specified in http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader) 2. ResolveURI not called on paths read from conf/prop files 3. inconsistent means of merging / overriding values from different sources (Some get overridden by file, others by manual settings of field on object, Some by properties) 4. Argument validation should be done after combining config files, system properties and command line arguments, 5. Alternate conf file location not handled in shell scripts 6. Some options can only be passed as command line arguments 7. Defaults for options are hard-coded (and sometimes overridden multiple times) in many through-out the code e.g. master = local[*] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3620) Refactor config option handling code for spark-submit
[ https://issues.apache.org/jira/browse/SPARK-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dale Richardson updated SPARK-3620: --- Summary: Refactor config option handling code for spark-submit (was: Refactor parameter handling code for spark-submit) Refactor config option handling code for spark-submit - Key: SPARK-3620 URL: https://issues.apache.org/jira/browse/SPARK-3620 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.0.0, 1.1.0 Reporter: Dale Richardson Assignee: Dale Richardson Priority: Minor I'm proposing its time to refactor the configuration argument handling code in spark-submit. The code has grown organically in a short period of time, handles a pretty complicated logic flow, and is now pretty fragile. Some issues that have been identified: 1. Hand-crafted property file readers that do not support the property file format as specified in http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader) 2. ResolveURI not called on paths read from conf/prop files 3. inconsistent means of merging / overriding values from different sources (Some get overridden by file, others by manual settings of field on object, Some by properties) 4. Argument validation should be done after combining config files, system properties and command line arguments, 5. Alternate conf file location not handled in shell scripts 6. Some options can only be passed as command line arguments 7. Defaults for options are hard-coded (and sometimes overridden multiple times) in many through-out the code e.g. master = local[*] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3247) Improved support for external data sources
[ https://issues.apache.org/jira/browse/SPARK-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142277#comment-14142277 ] Apache Spark commented on SPARK-3247: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/2475 Improved support for external data sources -- Key: SPARK-3247 URL: https://issues.apache.org/jira/browse/SPARK-3247 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3599) Avoid loading and printing properties file content frequently
[ https://issues.apache.org/jira/browse/SPARK-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3599. Resolution: Fixed Assignee: WangTaoTheTonic Avoid loading and printing properties file content frequently - Key: SPARK-3599 URL: https://issues.apache.org/jira/browse/SPARK-3599 Project: Spark Issue Type: Improvement Components: Deploy Reporter: WangTaoTheTonic Assignee: WangTaoTheTonic Priority: Minor Attachments: too many verbose.txt When I use -v | -verbos in spark-submit, there prints lots of message about contents in properties file. After checking code in SparkSubmit.scala and SparkSubmitArguments.scala, I found the getDefaultSparkProperties method is invoked in three places, and every time we invoke it, we load properties from properties file, and print again if option -v used. We might should use a value instead of method when we use default properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1966) Cannot cancel tasks running locally
[ https://issues.apache.org/jira/browse/SPARK-1966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142334#comment-14142334 ] Josh Rosen commented on SPARK-1966: --- Actually, scratch that; it wasn't an issue since local execution of tasks is now disabled by default. But I think this bug still exists. Cannot cancel tasks running locally --- Key: SPARK-1966 URL: https://issues.apache.org/jira/browse/SPARK-1966 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1, 1.0.0, 1.1.0 Reporter: Aaron Davidson -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1597) Add a version of reduceByKey that takes the Partitioner as a second argument
[ https://issues.apache.org/jira/browse/SPARK-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142335#comment-14142335 ] Patrick Wendell commented on SPARK-1597: See relevant comment here: https://github.com/apache/spark/pull/550#issuecomment-41483260 Add a version of reduceByKey that takes the Partitioner as a second argument Key: SPARK-1597 URL: https://issues.apache.org/jira/browse/SPARK-1597 Project: Spark Issue Type: Bug Reporter: Matei Zaharia Assignee: Sandeep Singh Priority: Blocker Most of our shuffle methods can take a Partitioner or a number of partitions as a second argument, but for some reason reduceByKey takes the Partitioner as a *first* argument: http://spark.apache.org/docs/0.9.1/api/core/#org.apache.spark.rdd.PairRDDFunctions. We should deprecate that version and add one where the Partitioner is the second argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
Xuefu Zhang created SPARK-3621: -- Summary: Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access Key: SPARK-3621 URL: https://issues.apache.org/jira/browse/SPARK-3621 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0, 1.0.0 Reporter: Xuefu Zhang In some cases, such as Hive's way of doing map-side join, it would be benefcial to allow client program to broadcast RDDs rather than just variables made of these RDDs. Broadcasting a variable made of RDDs requires all RDD data be collected to the driver and that the variable be shipped to the cluster after being made. It would be more performing if driver just broadcasts the RDDs and uses the corresponding data in jobs (such building hashmaps at executors). Tez has a broadcast edge which can ship data from previous stage to the next stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3622) Provide a custom transformation that can output multiple RDDs
Xuefu Zhang created SPARK-3622: -- Summary: Provide a custom transformation that can output multiple RDDs Key: SPARK-3622 URL: https://issues.apache.org/jira/browse/SPARK-3622 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Xuefu Zhang All existing transformations return just one RDD at most, even for those which takes user-supplied functions such as mapPartitions() . However, sometimes a user provided function may need to output multiple RDDs. For instance, a filter function that divides the input RDD into serveral RDDs. While it's possible to get multiple RDDs by transforming the same RDD multiple times, it may be more efficient to do this concurrently in one shot. Especially user's existing function is already generating different data sets. This the case in Hive on Spark, where Hive's map function and reduce function can output different data sets to be consumed by subsequent stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2630) Input data size of CoalescedRDD is incorrect
[ https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2630: --- Target Version/s: 1.2.0 (was: 1.1.0) Input data size of CoalescedRDD is incorrect Key: SPARK-2630 URL: https://issues.apache.org/jira/browse/SPARK-2630 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0, 1.0.1 Reporter: Davies Liu Priority: Critical Attachments: overflow.tiff Given one big file, such as text.4.3G, put it in one task, sc.textFile(text.4.3.G).coalesce(1).count() In Web UI of Spark, you will see that the input size is 5.4M. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2630) Input data size of CoalescedRDD is incorrect
[ https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2630: --- Assignee: Andrew Ash Input data size of CoalescedRDD is incorrect Key: SPARK-2630 URL: https://issues.apache.org/jira/browse/SPARK-2630 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0, 1.0.1 Reporter: Davies Liu Assignee: Andrew Ash Priority: Blocker Attachments: overflow.tiff Given one big file, such as text.4.3G, put it in one task, sc.textFile(text.4.3.G).coalesce(1).count() In Web UI of Spark, you will see that the input size is 5.4M. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2630) Input data size of CoalescedRDD is incorrect
[ https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2630: --- Priority: Blocker (was: Critical) Input data size of CoalescedRDD is incorrect Key: SPARK-2630 URL: https://issues.apache.org/jira/browse/SPARK-2630 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0, 1.0.1 Reporter: Davies Liu Priority: Blocker Attachments: overflow.tiff Given one big file, such as text.4.3G, put it in one task, sc.textFile(text.4.3.G).coalesce(1).count() In Web UI of Spark, you will see that the input size is 5.4M. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org