[jira] [Assigned] (SPARK-6765) Turn scalastyle on for test code
[ https://issues.apache.org/jira/browse/SPARK-6765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reassigned SPARK-6765: -- Assignee: Reynold Xin > Turn scalastyle on for test code > > > Key: SPARK-6765 > URL: https://issues.apache.org/jira/browse/SPARK-6765 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin >Assignee: Reynold Xin > > We should turn scalastyle on for test code. Test code should be as important > as main code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6762) Fix potential resource leaks in CheckPoint CheckpointWriter and CheckpointReader
[ https://issues.apache.org/jira/browse/SPARK-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6762: --- Assignee: Apache Spark > Fix potential resource leaks in CheckPoint CheckpointWriter and > CheckpointReader > > > Key: SPARK-6762 > URL: https://issues.apache.org/jira/browse/SPARK-6762 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: zhichao-li >Assignee: Apache Spark >Priority: Minor > > The close action should be placed within finally block to avoid the potential > resource leaks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6762) Fix potential resource leaks in CheckPoint CheckpointWriter and CheckpointReader
[ https://issues.apache.org/jira/browse/SPARK-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6762: --- Assignee: (was: Apache Spark) > Fix potential resource leaks in CheckPoint CheckpointWriter and > CheckpointReader > > > Key: SPARK-6762 > URL: https://issues.apache.org/jira/browse/SPARK-6762 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: zhichao-li >Priority: Minor > > The close action should be placed within finally block to avoid the potential > resource leaks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6762) Fix potential resource leaks in CheckPoint CheckpointWriter and CheckpointReader
[ https://issues.apache.org/jira/browse/SPARK-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484875#comment-14484875 ] Apache Spark commented on SPARK-6762: - User 'nareshpr' has created a pull request for this issue: https://github.com/apache/spark/pull/5413 > Fix potential resource leaks in CheckPoint CheckpointWriter and > CheckpointReader > > > Key: SPARK-6762 > URL: https://issues.apache.org/jira/browse/SPARK-6762 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: zhichao-li >Priority: Minor > > The close action should be placed within finally block to avoid the potential > resource leaks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6766) StreamingListenerBatchSubmitted isn't sent and StreamingListenerBatchStarted.batchInfo.processingStartTime is a wrong value
Shixiong Zhu created SPARK-6766: --- Summary: StreamingListenerBatchSubmitted isn't sent and StreamingListenerBatchStarted.batchInfo.processingStartTime is a wrong value Key: SPARK-6766 URL: https://issues.apache.org/jira/browse/SPARK-6766 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0, 1.2.1, 1.1.1, 1.0.2 Reporter: Shixiong Zhu 1. Now there is no place posting StreamingListenerBatchSubmitted. It should be post when JobSet is submitted. 2. Call JobSet.handleJobStart before posting StreamingListenerBatchStarted will set StreamingListenerBatchStarted.batchInfo.processingStartTime to None, which should have been set to a correct value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6766) StreamingListenerBatchSubmitted isn't sent and StreamingListenerBatchStarted.batchInfo.processingStartTime is a wrong value
[ https://issues.apache.org/jira/browse/SPARK-6766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484898#comment-14484898 ] Apache Spark commented on SPARK-6766: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/5414 > StreamingListenerBatchSubmitted isn't sent and > StreamingListenerBatchStarted.batchInfo.processingStartTime is a wrong value > --- > > Key: SPARK-6766 > URL: https://issues.apache.org/jira/browse/SPARK-6766 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 >Reporter: Shixiong Zhu > > 1. Now there is no place posting StreamingListenerBatchSubmitted. It should > be post when JobSet is submitted. > 2. Call JobSet.handleJobStart before posting StreamingListenerBatchStarted > will set StreamingListenerBatchStarted.batchInfo.processingStartTime to None, > which should have been set to a correct value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6766) StreamingListenerBatchSubmitted isn't sent and StreamingListenerBatchStarted.batchInfo.processingStartTime is a wrong value
[ https://issues.apache.org/jira/browse/SPARK-6766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6766: --- Assignee: (was: Apache Spark) > StreamingListenerBatchSubmitted isn't sent and > StreamingListenerBatchStarted.batchInfo.processingStartTime is a wrong value > --- > > Key: SPARK-6766 > URL: https://issues.apache.org/jira/browse/SPARK-6766 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 >Reporter: Shixiong Zhu > > 1. Now there is no place posting StreamingListenerBatchSubmitted. It should > be post when JobSet is submitted. > 2. Call JobSet.handleJobStart before posting StreamingListenerBatchStarted > will set StreamingListenerBatchStarted.batchInfo.processingStartTime to None, > which should have been set to a correct value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6766) StreamingListenerBatchSubmitted isn't sent and StreamingListenerBatchStarted.batchInfo.processingStartTime is a wrong value
[ https://issues.apache.org/jira/browse/SPARK-6766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6766: --- Assignee: Apache Spark > StreamingListenerBatchSubmitted isn't sent and > StreamingListenerBatchStarted.batchInfo.processingStartTime is a wrong value > --- > > Key: SPARK-6766 > URL: https://issues.apache.org/jira/browse/SPARK-6766 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > 1. Now there is no place posting StreamingListenerBatchSubmitted. It should > be post when JobSet is submitted. > 2. Call JobSet.handleJobStart before posting StreamingListenerBatchStarted > will set StreamingListenerBatchStarted.batchInfo.processingStartTime to None, > which should have been set to a correct value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6762) Fix potential resource leaks in CheckPoint CheckpointWriter and CheckpointReader
[ https://issues.apache.org/jira/browse/SPARK-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484900#comment-14484900 ] Apache Spark commented on SPARK-6762: - User 'zhichao-li' has created a pull request for this issue: https://github.com/apache/spark/pull/5407 > Fix potential resource leaks in CheckPoint CheckpointWriter and > CheckpointReader > > > Key: SPARK-6762 > URL: https://issues.apache.org/jira/browse/SPARK-6762 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: zhichao-li >Priority: Minor > > The close action should be placed within finally block to avoid the potential > resource leaks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6767) Documentation error in Spark SQL Readme file
Tijo Thomas created SPARK-6767: -- Summary: Documentation error in Spark SQL Readme file Key: SPARK-6767 URL: https://issues.apache.org/jira/browse/SPARK-6767 Project: Spark Issue Type: Bug Components: Documentation, SQL Affects Versions: 1.3.0 Reporter: Tijo Thomas Priority: Trivial Error in Spark SQL Documentation file . The sample script for SQL DSL throwing below error scala> query.where('key > 30).select(avg('key)).collect() :43: error: value > is not a member of Symbol query.where('key > 30).select(avg('key)).collect() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6767) Documentation error in Spark SQL Readme file
[ https://issues.apache.org/jira/browse/SPARK-6767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6767: --- Assignee: Apache Spark > Documentation error in Spark SQL Readme file > > > Key: SPARK-6767 > URL: https://issues.apache.org/jira/browse/SPARK-6767 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 1.3.0 >Reporter: Tijo Thomas >Assignee: Apache Spark >Priority: Trivial > > Error in Spark SQL Documentation file . The sample script for SQL DSL > throwing below error > scala> query.where('key > 30).select(avg('key)).collect() > :43: error: value > is not a member of Symbol > query.where('key > 30).select(avg('key)).collect() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6767) Documentation error in Spark SQL Readme file
[ https://issues.apache.org/jira/browse/SPARK-6767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6767: --- Assignee: (was: Apache Spark) > Documentation error in Spark SQL Readme file > > > Key: SPARK-6767 > URL: https://issues.apache.org/jira/browse/SPARK-6767 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 1.3.0 >Reporter: Tijo Thomas >Priority: Trivial > > Error in Spark SQL Documentation file . The sample script for SQL DSL > throwing below error > scala> query.where('key > 30).select(avg('key)).collect() > :43: error: value > is not a member of Symbol > query.where('key > 30).select(avg('key)).collect() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6767) Documentation error in Spark SQL Readme file
[ https://issues.apache.org/jira/browse/SPARK-6767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484946#comment-14484946 ] Apache Spark commented on SPARK-6767: - User 'tijoparacka' has created a pull request for this issue: https://github.com/apache/spark/pull/5415 > Documentation error in Spark SQL Readme file > > > Key: SPARK-6767 > URL: https://issues.apache.org/jira/browse/SPARK-6767 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 1.3.0 >Reporter: Tijo Thomas >Priority: Trivial > > Error in Spark SQL Documentation file . The sample script for SQL DSL > throwing below error > scala> query.where('key > 30).select(avg('key)).collect() > :43: error: value > is not a member of Symbol > query.where('key > 30).select(avg('key)).collect() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6768) Do not support "float/double union decimal and decimal(a ,b) union decimal(c, d)"
Zhongshuai Pei created SPARK-6768: - Summary: Do not support "float/double union decimal and decimal(a ,b) union decimal(c, d)" Key: SPARK-6768 URL: https://issues.apache.org/jira/browse/SPARK-6768 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhongshuai Pei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6769) Usage of the ListenerBus in YarnClusterSuite is wrong
Kousuke Saruta created SPARK-6769: - Summary: Usage of the ListenerBus in YarnClusterSuite is wrong Key: SPARK-6769 URL: https://issues.apache.org/jira/browse/SPARK-6769 Project: Spark Issue Type: Bug Components: Spark Core, Tests, YARN Affects Versions: 1.4.0 Reporter: Kousuke Saruta Priority: Minor In YarnClusterSuite, a test case uses `SaveExecutorInfo` to handle ExecutorAddedEvent as follows. {code} private class SaveExecutorInfo extends SparkListener { val addedExecutorInfos = mutable.Map[String, ExecutorInfo]() override def onExecutorAdded(executor: SparkListenerExecutorAdded) { addedExecutorInfos(executor.executorId) = executor.executorInfo } } ... listener = new SaveExecutorInfo val sc = new SparkContext(new SparkConf() .setAppName("yarn \"test app\" 'with quotes' and \\back\\slashes and $dollarSigns")) sc.addSparkListener(listener) val status = new File(args(0)) var result = "failure" try { val data = sc.parallelize(1 to 4, 4).collect().toSet assert(sc.listenerBus.waitUntilEmpty(WAIT_TIMEOUT_MILLIS)) data should be (Set(1, 2, 3, 4)) result = "success" } finally { sc.stop() Files.write(result, status, UTF_8) } {code} But, the usage is wrong because Executors will spawn during initializing SparkContext and SparkContext#addSparkListener should be invoked after the initialization, thus after Executors spawn, so SaveExecutorInfo cannot handle ExecutorAddedEvent. Following code refers the result of the handling ExecutorAddedEvent. Because of the reason above, we cannot reach the assertion. {code} // verify log urls are present listener.addedExecutorInfos.values.foreach { info => assert(info.logUrlMap.nonEmpty) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6768) Do not support "float/double union decimal and decimal(a ,b) union decimal(c, d)"
[ https://issues.apache.org/jira/browse/SPARK-6768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei updated SPARK-6768: -- Description: Do not support sql like that : ``` select cast(12.2056999 as float) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 ``` ``` select cast(12.2056999 as double) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 ``` ``` select cast(1241.20 as decimal(6, 2)) from testData limit 1 union select cast(1.204 as decimal(5, 3)) from testData limit 1 ``` > Do not support "float/double union decimal and decimal(a ,b) union decimal(c, > d)" > - > > Key: SPARK-6768 > URL: https://issues.apache.org/jira/browse/SPARK-6768 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhongshuai Pei > > Do not support sql like that : > ``` > select cast(12.2056999 as float) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > ``` > ``` > select cast(12.2056999 as double) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > ``` > ``` > select cast(1241.20 as decimal(6, 2)) from testData limit 1 > union > select cast(1.204 as decimal(5, 3)) from testData limit 1 > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1499) Workers continuously produce failing executors
[ https://issues.apache.org/jira/browse/SPARK-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484962#comment-14484962 ] XeonZhao commented on SPARK-1499: - How to solve this problem ?I have encountered this issue. > Workers continuously produce failing executors > -- > > Key: SPARK-1499 > URL: https://issues.apache.org/jira/browse/SPARK-1499 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 0.9.1, 1.0.0 >Reporter: Aaron Davidson > > If a node is in a bad state, such that newly started executors fail on > startup or first use, the Standalone Cluster Worker will happily keep > spawning new ones. A better behavior would be for a Worker to mark itself as > dead if it has had a history of continuously producing erroneous executors, > or else to somehow prevent a driver from re-registering executors from the > same machine repeatedly. > Reported on mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccal8t0bqjfgtf-vbzjq6yj7ckbl_9p9s0trvew2mvg6zbngx...@mail.gmail.com%3E > Relevant logs: > {noformat} > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/4 is now FAILED (Command exited with code 53) > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor > app-20140411190649-0008/4 removed: Command exited with code 53 > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor 4 > disconnected, so removing it > 14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4 > (already removed): Failed to create local directory (bad spark.local.dir?) > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor added: > app-20140411190649-0008/27 on > worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 > (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Granted executor > ID app-20140411190649-0008/27 on hostPort > ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/27 is now RUNNING > 14/04/11 19:06:52 INFO storage.BlockManagerMasterActor$BlockManagerInfo: > Registering block manager ip-172-31-24-76.us-west-1.compute.internal:50256 > with 32.7 GB RAM > 14/04/11 19:06:52 INFO metastore.HiveMetaStore: 0: get_table : db=default > tbl=wikistats_pd > 14/04/11 19:06:52 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr > cmd=get_table : db=default tbl=wikistats_pd > 14/04/11 19:06:53 DEBUG hive.log: DDL: struct wikistats_pd { string > projectcode, string pagename, i32 pageviews, i32 bytes} > 14/04/11 19:06:53 DEBUG lazy.LazySimpleSerDe: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe initialized with: > columnNames=[projectcode, pagename, pageviews, bytes] columnTypes=[string, > string, int, int] separator=[[B@29a81175] nullstring=\N > lastColumnTakesRest=false > shark> 14/04/11 19:06:55 INFO cluster.SparkDeploySchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-31-19-11.us-west-1.compute.internal:45248/user/Executor#-1002203295] > with ID 27 > show 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor 27 > disconnected, so removing it > 14/04/11 19:06:56 ERROR scheduler.TaskSchedulerImpl: Lost an executor 27 > (already removed): remote Akka client disassociated > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/27 is now FAILED (Command exited with code 53) > 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor > app-20140411190649-0008/27 removed: Command exited with code 53 > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor added: > app-20140411190649-0008/28 on > worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 > (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores > 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Granted executor > ID app-20140411190649-0008/28 on hostPort > ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/28 is now RUNNING > tables; > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6768) Do not support "float/double union decimal and decimal(a ,b) union decimal(c, d)"
[ https://issues.apache.org/jira/browse/SPARK-6768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei updated SPARK-6768: -- Description: Do not support sql like that : ... select cast(12.2056999 as float) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 ... ``` select cast(12.2056999 as double) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 ``` ``` select cast(1241.20 as decimal(6, 2)) from testData limit 1 union select cast(1.204 as decimal(5, 3)) from testData limit 1 ``` was: Do not support sql like that : ``` select cast(12.2056999 as float) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 ``` ``` select cast(12.2056999 as double) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 ``` ``` select cast(1241.20 as decimal(6, 2)) from testData limit 1 union select cast(1.204 as decimal(5, 3)) from testData limit 1 ``` > Do not support "float/double union decimal and decimal(a ,b) union decimal(c, > d)" > - > > Key: SPARK-6768 > URL: https://issues.apache.org/jira/browse/SPARK-6768 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhongshuai Pei > > Do not support sql like that : > ... > select cast(12.2056999 as float) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > ... > ``` > select cast(12.2056999 as double) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > ``` > ``` > select cast(1241.20 as decimal(6, 2)) from testData limit 1 > union > select cast(1.204 as decimal(5, 3)) from testData limit 1 > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6768) Do not support "float/double union decimal and decimal(a ,b) union decimal(c, d)"
[ https://issues.apache.org/jira/browse/SPARK-6768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei updated SPARK-6768: -- Description: Do not support sql like that : ... select cast(12.2056999 as float) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 ... ``` select cast(12.2056999 as double) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 ``` ``` select cast(1241.20 as decimal(6, 2)) from testData limit 1 union select cast(1.204 as decimal(5, 3)) from testData limit 1 ``` was: Do not support sql like that : ``` select cast(12.2056999 as float) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 ``` ``` select cast(12.2056999 as double) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 ``` ``` select cast(1241.20 as decimal(6, 2)) from testData limit 1 union select cast(1.204 as decimal(5, 3)) from testData limit 1 ``` > Do not support "float/double union decimal and decimal(a ,b) union decimal(c, > d)" > - > > Key: SPARK-6768 > URL: https://issues.apache.org/jira/browse/SPARK-6768 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhongshuai Pei > > Do not support sql like that : > ... > select cast(12.2056999 as float) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > ... > ``` > select cast(12.2056999 as double) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > ``` > ``` > select cast(1241.20 as decimal(6, 2)) from testData limit 1 > union > select cast(1.204 as decimal(5, 3)) from testData limit 1 > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1499) Workers continuously produce failing executors
[ https://issues.apache.org/jira/browse/SPARK-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484964#comment-14484964 ] XeonZhao commented on SPARK-1499: - How to solve this problem ?I have encountered this issue. > Workers continuously produce failing executors > -- > > Key: SPARK-1499 > URL: https://issues.apache.org/jira/browse/SPARK-1499 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 0.9.1, 1.0.0 >Reporter: Aaron Davidson > > If a node is in a bad state, such that newly started executors fail on > startup or first use, the Standalone Cluster Worker will happily keep > spawning new ones. A better behavior would be for a Worker to mark itself as > dead if it has had a history of continuously producing erroneous executors, > or else to somehow prevent a driver from re-registering executors from the > same machine repeatedly. > Reported on mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccal8t0bqjfgtf-vbzjq6yj7ckbl_9p9s0trvew2mvg6zbngx...@mail.gmail.com%3E > Relevant logs: > {noformat} > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/4 is now FAILED (Command exited with code 53) > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor > app-20140411190649-0008/4 removed: Command exited with code 53 > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor 4 > disconnected, so removing it > 14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4 > (already removed): Failed to create local directory (bad spark.local.dir?) > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor added: > app-20140411190649-0008/27 on > worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 > (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Granted executor > ID app-20140411190649-0008/27 on hostPort > ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/27 is now RUNNING > 14/04/11 19:06:52 INFO storage.BlockManagerMasterActor$BlockManagerInfo: > Registering block manager ip-172-31-24-76.us-west-1.compute.internal:50256 > with 32.7 GB RAM > 14/04/11 19:06:52 INFO metastore.HiveMetaStore: 0: get_table : db=default > tbl=wikistats_pd > 14/04/11 19:06:52 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr > cmd=get_table : db=default tbl=wikistats_pd > 14/04/11 19:06:53 DEBUG hive.log: DDL: struct wikistats_pd { string > projectcode, string pagename, i32 pageviews, i32 bytes} > 14/04/11 19:06:53 DEBUG lazy.LazySimpleSerDe: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe initialized with: > columnNames=[projectcode, pagename, pageviews, bytes] columnTypes=[string, > string, int, int] separator=[[B@29a81175] nullstring=\N > lastColumnTakesRest=false > shark> 14/04/11 19:06:55 INFO cluster.SparkDeploySchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-31-19-11.us-west-1.compute.internal:45248/user/Executor#-1002203295] > with ID 27 > show 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor 27 > disconnected, so removing it > 14/04/11 19:06:56 ERROR scheduler.TaskSchedulerImpl: Lost an executor 27 > (already removed): remote Akka client disassociated > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/27 is now FAILED (Command exited with code 53) > 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor > app-20140411190649-0008/27 removed: Command exited with code 53 > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor added: > app-20140411190649-0008/28 on > worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 > (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores > 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Granted executor > ID app-20140411190649-0008/28 on hostPort > ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/28 is now RUNNING > tables; > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6769) Usage of the ListenerBus in YarnClusterSuite is wrong
[ https://issues.apache.org/jira/browse/SPARK-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484966#comment-14484966 ] Apache Spark commented on SPARK-6769: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/5417 > Usage of the ListenerBus in YarnClusterSuite is wrong > - > > Key: SPARK-6769 > URL: https://issues.apache.org/jira/browse/SPARK-6769 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests, YARN >Affects Versions: 1.4.0 >Reporter: Kousuke Saruta >Priority: Minor > > In YarnClusterSuite, a test case uses `SaveExecutorInfo` to handle > ExecutorAddedEvent as follows. > {code} > private class SaveExecutorInfo extends SparkListener { > val addedExecutorInfos = mutable.Map[String, ExecutorInfo]() > override def onExecutorAdded(executor: SparkListenerExecutorAdded) { > addedExecutorInfos(executor.executorId) = executor.executorInfo > } > } > ... > listener = new SaveExecutorInfo > val sc = new SparkContext(new SparkConf() > .setAppName("yarn \"test app\" 'with quotes' and \\back\\slashes and > $dollarSigns")) > sc.addSparkListener(listener) > val status = new File(args(0)) > var result = "failure" > try { > val data = sc.parallelize(1 to 4, 4).collect().toSet > assert(sc.listenerBus.waitUntilEmpty(WAIT_TIMEOUT_MILLIS)) > data should be (Set(1, 2, 3, 4)) > result = "success" > } finally { > sc.stop() > Files.write(result, status, UTF_8) > } > {code} > But, the usage is wrong because Executors will spawn during initializing > SparkContext and SparkContext#addSparkListener should be invoked after the > initialization, thus after Executors spawn, so SaveExecutorInfo cannot handle > ExecutorAddedEvent. > Following code refers the result of the handling ExecutorAddedEvent. Because > of the reason above, we cannot reach the assertion. > {code} > // verify log urls are present > listener.addedExecutorInfos.values.foreach { info => > assert(info.logUrlMap.nonEmpty) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6769) Usage of the ListenerBus in YarnClusterSuite is wrong
[ https://issues.apache.org/jira/browse/SPARK-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6769: --- Assignee: (was: Apache Spark) > Usage of the ListenerBus in YarnClusterSuite is wrong > - > > Key: SPARK-6769 > URL: https://issues.apache.org/jira/browse/SPARK-6769 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests, YARN >Affects Versions: 1.4.0 >Reporter: Kousuke Saruta >Priority: Minor > > In YarnClusterSuite, a test case uses `SaveExecutorInfo` to handle > ExecutorAddedEvent as follows. > {code} > private class SaveExecutorInfo extends SparkListener { > val addedExecutorInfos = mutable.Map[String, ExecutorInfo]() > override def onExecutorAdded(executor: SparkListenerExecutorAdded) { > addedExecutorInfos(executor.executorId) = executor.executorInfo > } > } > ... > listener = new SaveExecutorInfo > val sc = new SparkContext(new SparkConf() > .setAppName("yarn \"test app\" 'with quotes' and \\back\\slashes and > $dollarSigns")) > sc.addSparkListener(listener) > val status = new File(args(0)) > var result = "failure" > try { > val data = sc.parallelize(1 to 4, 4).collect().toSet > assert(sc.listenerBus.waitUntilEmpty(WAIT_TIMEOUT_MILLIS)) > data should be (Set(1, 2, 3, 4)) > result = "success" > } finally { > sc.stop() > Files.write(result, status, UTF_8) > } > {code} > But, the usage is wrong because Executors will spawn during initializing > SparkContext and SparkContext#addSparkListener should be invoked after the > initialization, thus after Executors spawn, so SaveExecutorInfo cannot handle > ExecutorAddedEvent. > Following code refers the result of the handling ExecutorAddedEvent. Because > of the reason above, we cannot reach the assertion. > {code} > // verify log urls are present > listener.addedExecutorInfos.values.foreach { info => > assert(info.logUrlMap.nonEmpty) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1499) Workers continuously produce failing executors
[ https://issues.apache.org/jira/browse/SPARK-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484963#comment-14484963 ] XeonZhao commented on SPARK-1499: - How to solve this problem ?I have encountered this issue. > Workers continuously produce failing executors > -- > > Key: SPARK-1499 > URL: https://issues.apache.org/jira/browse/SPARK-1499 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 0.9.1, 1.0.0 >Reporter: Aaron Davidson > > If a node is in a bad state, such that newly started executors fail on > startup or first use, the Standalone Cluster Worker will happily keep > spawning new ones. A better behavior would be for a Worker to mark itself as > dead if it has had a history of continuously producing erroneous executors, > or else to somehow prevent a driver from re-registering executors from the > same machine repeatedly. > Reported on mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccal8t0bqjfgtf-vbzjq6yj7ckbl_9p9s0trvew2mvg6zbngx...@mail.gmail.com%3E > Relevant logs: > {noformat} > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/4 is now FAILED (Command exited with code 53) > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor > app-20140411190649-0008/4 removed: Command exited with code 53 > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor 4 > disconnected, so removing it > 14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4 > (already removed): Failed to create local directory (bad spark.local.dir?) > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor added: > app-20140411190649-0008/27 on > worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 > (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Granted executor > ID app-20140411190649-0008/27 on hostPort > ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/27 is now RUNNING > 14/04/11 19:06:52 INFO storage.BlockManagerMasterActor$BlockManagerInfo: > Registering block manager ip-172-31-24-76.us-west-1.compute.internal:50256 > with 32.7 GB RAM > 14/04/11 19:06:52 INFO metastore.HiveMetaStore: 0: get_table : db=default > tbl=wikistats_pd > 14/04/11 19:06:52 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr > cmd=get_table : db=default tbl=wikistats_pd > 14/04/11 19:06:53 DEBUG hive.log: DDL: struct wikistats_pd { string > projectcode, string pagename, i32 pageviews, i32 bytes} > 14/04/11 19:06:53 DEBUG lazy.LazySimpleSerDe: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe initialized with: > columnNames=[projectcode, pagename, pageviews, bytes] columnTypes=[string, > string, int, int] separator=[[B@29a81175] nullstring=\N > lastColumnTakesRest=false > shark> 14/04/11 19:06:55 INFO cluster.SparkDeploySchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-31-19-11.us-west-1.compute.internal:45248/user/Executor#-1002203295] > with ID 27 > show 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor 27 > disconnected, so removing it > 14/04/11 19:06:56 ERROR scheduler.TaskSchedulerImpl: Lost an executor 27 > (already removed): remote Akka client disassociated > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/27 is now FAILED (Command exited with code 53) > 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor > app-20140411190649-0008/27 removed: Command exited with code 53 > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor added: > app-20140411190649-0008/28 on > worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 > (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores > 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Granted executor > ID app-20140411190649-0008/28 on hostPort > ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/28 is now RUNNING > tables; > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6768) Do not support "float/double union decimal and decimal(a ,b) union decimal(c, d)"
[ https://issues.apache.org/jira/browse/SPARK-6768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei updated SPARK-6768: -- Description: Do not support sql like that : ``` select cast(12.2056999 as float) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 ``` ``` select cast(12.2056999 as double) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 ``` ``` select cast(1241.20 as decimal(6, 2)) from testData limit 1 union select cast(1.204 as decimal(5, 3)) from testData limit 1 ``` was: Do not support sql like that : ... select cast(12.2056999 as float) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 ... ``` select cast(12.2056999 as double) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 ``` ``` select cast(1241.20 as decimal(6, 2)) from testData limit 1 union select cast(1.204 as decimal(5, 3)) from testData limit 1 ``` > Do not support "float/double union decimal and decimal(a ,b) union decimal(c, > d)" > - > > Key: SPARK-6768 > URL: https://issues.apache.org/jira/browse/SPARK-6768 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhongshuai Pei > > Do not support sql like that : > ``` > select cast(12.2056999 as float) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > ``` > ``` > select cast(12.2056999 as double) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > ``` > ``` > select cast(1241.20 as decimal(6, 2)) from testData limit 1 > union > select cast(1.204 as decimal(5, 3)) from testData limit 1 > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6768) Do not support "float/double union decimal or decimal(a ,b) union decimal(c, d)"
[ https://issues.apache.org/jira/browse/SPARK-6768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei updated SPARK-6768: -- Summary: Do not support "float/double union decimal or decimal(a ,b) union decimal(c, d)" (was: Do not support "float/double union decimal and decimal(a ,b) union decimal(c, d)") > Do not support "float/double union decimal or decimal(a ,b) union decimal(c, > d)" > > > Key: SPARK-6768 > URL: https://issues.apache.org/jira/browse/SPARK-6768 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhongshuai Pei > > Do not support sql like that : > ... > select cast(12.2056999 as float) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > ... > ``` > select cast(12.2056999 as double) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > ``` > ``` > select cast(1241.20 as decimal(6, 2)) from testData limit 1 > union > select cast(1.204 as decimal(5, 3)) from testData limit 1 > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6768) Do not support "float/double union decimal or decimal(a ,b) union decimal(c, d)"
[ https://issues.apache.org/jira/browse/SPARK-6768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhongshuai Pei updated SPARK-6768: -- Description: Do not support sql like that : select cast(12.2056999 as float) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 select cast(12.2056999 as double) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 select cast(1241.20 as decimal(6, 2)) from testData limit 1 union select cast(1.204 as decimal(5, 3)) from testData limit 1 was: Do not support sql like that : ... select cast(12.2056999 as float) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 ... ``` select cast(12.2056999 as double) from testData limit 1 union select cast(12.2041 as decimal(7, 4)) from testData limit 1 ``` ``` select cast(1241.20 as decimal(6, 2)) from testData limit 1 union select cast(1.204 as decimal(5, 3)) from testData limit 1 ``` > Do not support "float/double union decimal or decimal(a ,b) union decimal(c, > d)" > > > Key: SPARK-6768 > URL: https://issues.apache.org/jira/browse/SPARK-6768 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhongshuai Pei > > Do not support sql like that : > select cast(12.2056999 as float) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > select cast(12.2056999 as double) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > select cast(1241.20 as decimal(6, 2)) from testData limit 1 > union > select cast(1.204 as decimal(5, 3)) from testData limit 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6769) Usage of the ListenerBus in YarnClusterSuite is wrong
[ https://issues.apache.org/jira/browse/SPARK-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6769: --- Assignee: Apache Spark > Usage of the ListenerBus in YarnClusterSuite is wrong > - > > Key: SPARK-6769 > URL: https://issues.apache.org/jira/browse/SPARK-6769 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests, YARN >Affects Versions: 1.4.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > In YarnClusterSuite, a test case uses `SaveExecutorInfo` to handle > ExecutorAddedEvent as follows. > {code} > private class SaveExecutorInfo extends SparkListener { > val addedExecutorInfos = mutable.Map[String, ExecutorInfo]() > override def onExecutorAdded(executor: SparkListenerExecutorAdded) { > addedExecutorInfos(executor.executorId) = executor.executorInfo > } > } > ... > listener = new SaveExecutorInfo > val sc = new SparkContext(new SparkConf() > .setAppName("yarn \"test app\" 'with quotes' and \\back\\slashes and > $dollarSigns")) > sc.addSparkListener(listener) > val status = new File(args(0)) > var result = "failure" > try { > val data = sc.parallelize(1 to 4, 4).collect().toSet > assert(sc.listenerBus.waitUntilEmpty(WAIT_TIMEOUT_MILLIS)) > data should be (Set(1, 2, 3, 4)) > result = "success" > } finally { > sc.stop() > Files.write(result, status, UTF_8) > } > {code} > But, the usage is wrong because Executors will spawn during initializing > SparkContext and SparkContext#addSparkListener should be invoked after the > initialization, thus after Executors spawn, so SaveExecutorInfo cannot handle > ExecutorAddedEvent. > Following code refers the result of the handling ExecutorAddedEvent. Because > of the reason above, we cannot reach the assertion. > {code} > // verify log urls are present > listener.addedExecutorInfos.values.foreach { info => > assert(info.logUrlMap.nonEmpty) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6768) Do not support "float/double union decimal or decimal(a ,b) union decimal(c, d)"
[ https://issues.apache.org/jira/browse/SPARK-6768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6768: --- Assignee: (was: Apache Spark) > Do not support "float/double union decimal or decimal(a ,b) union decimal(c, > d)" > > > Key: SPARK-6768 > URL: https://issues.apache.org/jira/browse/SPARK-6768 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhongshuai Pei > > Do not support sql like that : > select cast(12.2056999 as float) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > select cast(12.2056999 as double) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > select cast(1241.20 as decimal(6, 2)) from testData limit 1 > union > select cast(1.204 as decimal(5, 3)) from testData limit 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6768) Do not support "float/double union decimal or decimal(a ,b) union decimal(c, d)"
[ https://issues.apache.org/jira/browse/SPARK-6768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484967#comment-14484967 ] Apache Spark commented on SPARK-6768: - User 'DoingDone9' has created a pull request for this issue: https://github.com/apache/spark/pull/5418 > Do not support "float/double union decimal or decimal(a ,b) union decimal(c, > d)" > > > Key: SPARK-6768 > URL: https://issues.apache.org/jira/browse/SPARK-6768 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhongshuai Pei > > Do not support sql like that : > select cast(12.2056999 as float) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > select cast(12.2056999 as double) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > select cast(1241.20 as decimal(6, 2)) from testData limit 1 > union > select cast(1.204 as decimal(5, 3)) from testData limit 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6768) Do not support "float/double union decimal or decimal(a ,b) union decimal(c, d)"
[ https://issues.apache.org/jira/browse/SPARK-6768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6768: --- Assignee: Apache Spark > Do not support "float/double union decimal or decimal(a ,b) union decimal(c, > d)" > > > Key: SPARK-6768 > URL: https://issues.apache.org/jira/browse/SPARK-6768 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Zhongshuai Pei >Assignee: Apache Spark > > Do not support sql like that : > select cast(12.2056999 as float) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > select cast(12.2056999 as double) from testData limit 1 > union > select cast(12.2041 as decimal(7, 4)) from testData limit 1 > select cast(1241.20 as decimal(6, 2)) from testData limit 1 > union > select cast(1.204 as decimal(5, 3)) from testData limit 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6769) Usage of the ListenerBus in YarnClusterSuite is wrong
[ https://issues.apache.org/jira/browse/SPARK-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-6769: -- Component/s: (was: Spark Core) > Usage of the ListenerBus in YarnClusterSuite is wrong > - > > Key: SPARK-6769 > URL: https://issues.apache.org/jira/browse/SPARK-6769 > Project: Spark > Issue Type: Bug > Components: Tests, YARN >Affects Versions: 1.4.0 >Reporter: Kousuke Saruta >Priority: Minor > > In YarnClusterSuite, a test case uses `SaveExecutorInfo` to handle > ExecutorAddedEvent as follows. > {code} > private class SaveExecutorInfo extends SparkListener { > val addedExecutorInfos = mutable.Map[String, ExecutorInfo]() > override def onExecutorAdded(executor: SparkListenerExecutorAdded) { > addedExecutorInfos(executor.executorId) = executor.executorInfo > } > } > ... > listener = new SaveExecutorInfo > val sc = new SparkContext(new SparkConf() > .setAppName("yarn \"test app\" 'with quotes' and \\back\\slashes and > $dollarSigns")) > sc.addSparkListener(listener) > val status = new File(args(0)) > var result = "failure" > try { > val data = sc.parallelize(1 to 4, 4).collect().toSet > assert(sc.listenerBus.waitUntilEmpty(WAIT_TIMEOUT_MILLIS)) > data should be (Set(1, 2, 3, 4)) > result = "success" > } finally { > sc.stop() > Files.write(result, status, UTF_8) > } > {code} > But, the usage is wrong because Executors will spawn during initializing > SparkContext and SparkContext#addSparkListener should be invoked after the > initialization, thus after Executors spawn, so SaveExecutorInfo cannot handle > ExecutorAddedEvent. > Following code refers the result of the handling ExecutorAddedEvent. Because > of the reason above, we cannot reach the assertion. > {code} > // verify log urls are present > listener.addedExecutorInfos.values.foreach { info => > assert(info.logUrlMap.nonEmpty) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6733) Suppression of usage of Scala existential code should be done
[ https://issues.apache.org/jira/browse/SPARK-6733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6733: - Assignee: Vinod KC > Suppression of usage of Scala existential code should be done > - > > Key: SPARK-6733 > URL: https://issues.apache.org/jira/browse/SPARK-6733 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.3.0 > Environment: OS: OSX Yosemite > Hardware: Intel Core i7 with 16 GB RAM >Reporter: Raymond Tay >Assignee: Vinod KC >Priority: Trivial > Fix For: 1.4.0 > > > The inclusion of this statement in the file > {code:scala} > import scala.language.existentials > {code} > should have suppressed all warnings regarding the use of scala existential > code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated SPARK-6770: --- Description: I am read the data from kafka using createDirectStream method and save the received log to Mysql, the code snippets as follows {code} def functionToCreateContext(): StreamingContext = { val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(10)) ssc.checkpoint("/tmp/kafka/channel/offset") // set checkpoint directory ssc } val struct = StructType(StructField("log", StringType) ::Nil) // Get StreamingContext from checkpoint data or create a new one val ssc = StreamingContext.getOrCreate("/tmp/kafka/channel/offset", functionToCreateContext) val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext) SDB.foreachRDD(rdd => { val result = rdd.map(item => { println(item) val result = item._2 match { case e: String => Row.apply(e) case _ => Row.apply("") } result }) println(result.count()) val df = sqlContext.createDataFrame(result, struct) df.insertIntoJDBC(url, "test", overwrite = false) }) ssc.start() ssc.awaitTermination() ssc.stop() {/code} But when I recovery the program from checkpoint, I encountered an exception: {code} Exception in thread "main" org.apache.spark.SparkException: org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not been initialized at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266) at org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218) at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89) at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67) at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512) at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57) at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {/code} Not sure if this is a bug or a feature, but it's not obvious, so wanted to create a JIRA to make sure we document this behavior.Is someone can help me to see th
[jira] [Created] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint
yangping wu created SPARK-6770: -- Summary: DirectKafkaInputDStream has not been initialized when recovery from checkpoint Key: SPARK-6770 URL: https://issues.apache.org/jira/browse/SPARK-6770 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.0 Reporter: yangping wu I am read the data from kafka using createDirectStream method and save the received log to Mysql, the code snippets as follows {code} def functionToCreateContext(): StreamingContext = { val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(10)) ssc.checkpoint("/tmp/kafka/channel/offset") // set checkpoint directory ssc } val struct = StructType(StructField("log", StringType) ::Nil) // Get StreamingContext from checkpoint data or create a new one val ssc = StreamingContext.getOrCreate("/tmp/kafka/channel/offset", functionToCreateContext) val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext) SDB.foreachRDD(rdd => { val result = rdd.map(item => { println(item) val result = item._2 match { case e: String => Row.apply(e) case _ => Row.apply("") } result }) println(result.count()) val df = sqlContext.createDataFrame(result, struct) df.insertIntoJDBC(url, "test", overwrite = false) }) ssc.start() ssc.awaitTermination() ssc.stop() {/code} But when I recovery the program from checkpoint, I encountered an exception: {code} Exception in thread "main" org.apache.spark.SparkException: org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not been initialized at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266) at org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218) at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89) at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67) at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512) at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57) at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark
[jira] [Updated] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated SPARK-6770: --- Description: I am read the data from kafka using createDirectStream method and save the received log to Mysql, the code snippets as follows {code} def functionToCreateContext(): StreamingContext = { val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(10)) ssc.checkpoint("/tmp/kafka/channel/offset") // set checkpoint directory ssc } val struct = StructType(StructField("log", StringType) ::Nil) // Get StreamingContext from checkpoint data or create a new one val ssc = StreamingContext.getOrCreate("/tmp/kafka/channel/offset", functionToCreateContext) val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext) SDB.foreachRDD(rdd => { val result = rdd.map(item => { println(item) val result = item._2 match { case e: String => Row.apply(e) case _ => Row.apply("") } result }) println(result.count()) val df = sqlContext.createDataFrame(result, struct) df.insertIntoJDBC(url, "test", overwrite = false) }) ssc.start() ssc.awaitTermination() ssc.stop() {code} But when I recovery the program from checkpoint, I encountered an exception: {code} Exception in thread "main" org.apache.spark.SparkException: org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not been initialized at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266) at org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218) at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89) at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67) at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512) at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57) at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} Not sure if this is a bug or a feature, but it's not obvious, so wanted to create a JIRA to make sure we document this behavior.Is someone can help me to see the
[jira] [Updated] (SPARK-6770) DirectKafkaInputDStream has not been initialized when recovery from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yangping wu updated SPARK-6770: --- Description: I am read data from kafka using createDirectStream method and save the received log to Mysql, the code snippets as follows {code} def functionToCreateContext(): StreamingContext = { val sparkConf = new SparkConf() val sc = new SparkContext(sparkConf) val ssc = new StreamingContext(sc, Seconds(10)) ssc.checkpoint("/tmp/kafka/channel/offset") // set checkpoint directory ssc } val struct = StructType(StructField("log", StringType) ::Nil) // Get StreamingContext from checkpoint data or create a new one val ssc = StreamingContext.getOrCreate("/tmp/kafka/channel/offset", functionToCreateContext) val SDB = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) val sqlContext = new org.apache.spark.sql.SQLContext(ssc.sparkContext) SDB.foreachRDD(rdd => { val result = rdd.map(item => { println(item) val result = item._2 match { case e: String => Row.apply(e) case _ => Row.apply("") } result }) println(result.count()) val df = sqlContext.createDataFrame(result, struct) df.insertIntoJDBC(url, "test", overwrite = false) }) ssc.start() ssc.awaitTermination() ssc.stop() {code} But when I recovery the program from checkpoint, I encountered an exception: {code} Exception in thread "main" org.apache.spark.SparkException: org.apache.spark.streaming.kafka.DirectKafkaInputDStream@41a80e5a has not been initialized at org.apache.spark.streaming.dstream.DStream.isTimeValid(DStream.scala:266) at org.apache.spark.streaming.dstream.InputDStream.isTimeValid(InputDStream.scala:51) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:38) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at org.apache.spark.streaming.DStreamGraph$$anonfun$1.apply(DStreamGraph.scala:116) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.streaming.DStreamGraph.generateJobs(DStreamGraph.scala:116) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:223) at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$restart$4.apply(JobGenerator.scala:218) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.streaming.scheduler.JobGenerator.restart(JobGenerator.scala:218) at org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:89) at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67) at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:512) at logstatstreaming.UserChannelTodb$.main(UserChannelTodb.scala:57) at logstatstreaming.UserChannelTodb.main(UserChannelTodb.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} Not sure if this is a bug or a feature, but it's not obvious, so wanted to create a JIRA to make sure we document this behavior.Is someone can help me to see the reas
[jira] [Resolved] (SPARK-6699) PySpark Acess Denied error in windows seen only in ver 1.3
[ https://issues.apache.org/jira/browse/SPARK-6699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6699. -- Resolution: Not A Problem This looks strongly like a local permissions issue and/or lack of numpy, but can be reopened if there is more follow-up about the cause of the error. > PySpark Acess Denied error in windows seen only in ver 1.3 > --- > > Key: SPARK-6699 > URL: https://issues.apache.org/jira/browse/SPARK-6699 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0 > Environment: Windows 8.1 x64 > Windows 7 SP1 x64 >Reporter: RoCm > > Downloaded version 1.3 and tried to run pyspark > I hit this error and unable to proceed (tried versions 1.2 and 1.1 works fine) > Pasting the error logs below > C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\bin>pyspark > Running python with > PYTHONPATH=C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\bin\..\python\lib\py4j-0.8.2.1-src.zip; > C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\bin\..\python; > Python 2.7.8 (default, Jun 30 2014, 16:03:49) [MSC v.1500 32 bit (Intel)] on > win32 > Type "help", "copyright", "credits" or "license" for more information. > No module named numpy > Traceback (most recent call last): File > "C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\bin\..\python\pyspark\shell.py", > line 50, in > sc = SparkContext(appName="PySparkShell", pyFiles=add_files) > File > "C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\python\pyspark\context.py", > line 108, in __init__ > SparkContext._ensure_initialized(self, gateway=gateway) > File > "C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\python\pyspark\context.py", > line 222, in _ensure_initialized > SparkContext._gateway = gateway or launch_gateway() > File > "C:\Users\roXYZ\.babun\cygwin\home\roXYZ\spark-1.3.0-bin-hadoop2.4\python\pyspark\java_gateway.py", > line 65, in launch_gateway > proc = Popen(command, stdin=PIPE, env=env) > File "C:\Python27\lib\subprocess.py", line 710, in __init__errread, > errwrite) > File "C:\Python27\lib\subprocess.py", line 958, in _execute_child > startupinfo) > WindowsError: [Error 5] Access is denied -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6771) Table alias in Spark SQL
Jacky19820629 created SPARK-6771: Summary: Table alias in Spark SQL Key: SPARK-6771 URL: https://issues.apache.org/jira/browse/SPARK-6771 Project: Spark Issue Type: Question Components: SQL Affects Versions: 1.2.1 Environment: Spark 1.2.1 build with Hive 0.13.1 on YARN 2.5.2 Reporter: Jacky19820629 I had cache several tables in memory , I found Spark will using Hadoop data when i running a SQL with table alias , is there any configuration can fix it ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6681) JAVA_HOME error with upgrade to Spark 1.3.0
[ https://issues.apache.org/jira/browse/SPARK-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6681. -- Resolution: Cannot Reproduce We can reopen this if there is a follow-up with more detail but looks like a YARN / env issue. > JAVA_HOME error with upgrade to Spark 1.3.0 > --- > > Key: SPARK-6681 > URL: https://issues.apache.org/jira/browse/SPARK-6681 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.3.0 > Environment: Client is Mac OS X version 10.10.2, cluster is running > HDP 2.1 stack. >Reporter: Ken Williams > > I’m trying to upgrade a Spark project, written in Scala, from Spark 1.2.1 to > 1.3.0, so I changed my `build.sbt` like so: > {code} > -libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.1" % > "provided" > +libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % > "provided" > {code} > then make an `assembly` jar, and submit it: > {code} > HADOOP_CONF_DIR=/etc/hadoop/conf \ > spark-submit \ > --driver-class-path=/etc/hbase/conf \ > --conf spark.hadoop.validateOutputSpecs=false \ > --conf > spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.3.0-hadoop2.4.0.jar \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --deploy-mode=cluster \ > --master=yarn \ > --class=TestObject \ > --num-executors=54 \ > target/scala-2.11/myapp-assembly-1.2.jar > {code} > The job fails to submit, with the following exception in the terminal: > {code} > 15/03/19 10:30:07 INFO yarn.Client: > 15/03/19 10:20:03 INFO yarn.Client: >client token: N/A >diagnostics: Application application_1420225286501_4698 failed 2 times > due to AM > Container for appattempt_1420225286501_4698_02 exited with > exitCode: 127 > due to: Exception from container-launch: > org.apache.hadoop.util.Shell$ExitCodeException: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) > at org.apache.hadoop.util.Shell.run(Shell.java:379) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > {code} > Finally, I go and check the YARN app master’s web interface (since the job is > there, I know it at least made it that far), and the only logs it shows are > these: > {code} > Log Type: stderr > Log Length: 61 > /bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory > > Log Type: stdout > Log Length: 0 > {code} > I’m not sure how to interpret that - is {{ {{JAVA_HOME}} }} a literal > (including the brackets) that’s somehow making it into a script? Is this > coming from the worker nodes or the driver? Anything I can do to experiment > & troubleshoot? > I do have {{JAVA_HOME}} set in the hadoop config files on all the nodes of > the cluster: > {code} > % grep JAVA_HOME /etc/hadoop/conf/*.sh > /etc/hadoop/conf/hadoop-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31 > /etc/hadoop/conf/yarn-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31 > {code} > Has this behavior changed in 1.3.0 since 1.2.1? Using 1.2.1 and making no > other changes, the job completes fine. > (Note: I originally posted this on the Spark mailing list and also on Stack > Overflow, I'll update both places if/when I find a solution.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-1499) Workers continuously produce failing executors
[ https://issues.apache.org/jira/browse/SPARK-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XeonZhao updated SPARK-1499: Comment: was deleted (was: How to solve this problem ?I have encountered this issue.) > Workers continuously produce failing executors > -- > > Key: SPARK-1499 > URL: https://issues.apache.org/jira/browse/SPARK-1499 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 0.9.1, 1.0.0 >Reporter: Aaron Davidson > > If a node is in a bad state, such that newly started executors fail on > startup or first use, the Standalone Cluster Worker will happily keep > spawning new ones. A better behavior would be for a Worker to mark itself as > dead if it has had a history of continuously producing erroneous executors, > or else to somehow prevent a driver from re-registering executors from the > same machine repeatedly. > Reported on mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccal8t0bqjfgtf-vbzjq6yj7ckbl_9p9s0trvew2mvg6zbngx...@mail.gmail.com%3E > Relevant logs: > {noformat} > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/4 is now FAILED (Command exited with code 53) > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor > app-20140411190649-0008/4 removed: Command exited with code 53 > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor 4 > disconnected, so removing it > 14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4 > (already removed): Failed to create local directory (bad spark.local.dir?) > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor added: > app-20140411190649-0008/27 on > worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 > (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Granted executor > ID app-20140411190649-0008/27 on hostPort > ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/27 is now RUNNING > 14/04/11 19:06:52 INFO storage.BlockManagerMasterActor$BlockManagerInfo: > Registering block manager ip-172-31-24-76.us-west-1.compute.internal:50256 > with 32.7 GB RAM > 14/04/11 19:06:52 INFO metastore.HiveMetaStore: 0: get_table : db=default > tbl=wikistats_pd > 14/04/11 19:06:52 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr > cmd=get_table : db=default tbl=wikistats_pd > 14/04/11 19:06:53 DEBUG hive.log: DDL: struct wikistats_pd { string > projectcode, string pagename, i32 pageviews, i32 bytes} > 14/04/11 19:06:53 DEBUG lazy.LazySimpleSerDe: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe initialized with: > columnNames=[projectcode, pagename, pageviews, bytes] columnTypes=[string, > string, int, int] separator=[[B@29a81175] nullstring=\N > lastColumnTakesRest=false > shark> 14/04/11 19:06:55 INFO cluster.SparkDeploySchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-31-19-11.us-west-1.compute.internal:45248/user/Executor#-1002203295] > with ID 27 > show 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor 27 > disconnected, so removing it > 14/04/11 19:06:56 ERROR scheduler.TaskSchedulerImpl: Lost an executor 27 > (already removed): remote Akka client disassociated > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/27 is now FAILED (Command exited with code 53) > 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor > app-20140411190649-0008/27 removed: Command exited with code 53 > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor added: > app-20140411190649-0008/28 on > worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 > (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores > 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Granted executor > ID app-20140411190649-0008/28 on hostPort > ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/28 is now RUNNING > tables; > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-1499) Workers continuously produce failing executors
[ https://issues.apache.org/jira/browse/SPARK-1499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XeonZhao updated SPARK-1499: Comment: was deleted (was: How to solve this problem ?I have encountered this issue.) > Workers continuously produce failing executors > -- > > Key: SPARK-1499 > URL: https://issues.apache.org/jira/browse/SPARK-1499 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 0.9.1, 1.0.0 >Reporter: Aaron Davidson > > If a node is in a bad state, such that newly started executors fail on > startup or first use, the Standalone Cluster Worker will happily keep > spawning new ones. A better behavior would be for a Worker to mark itself as > dead if it has had a history of continuously producing erroneous executors, > or else to somehow prevent a driver from re-registering executors from the > same machine repeatedly. > Reported on mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccal8t0bqjfgtf-vbzjq6yj7ckbl_9p9s0trvew2mvg6zbngx...@mail.gmail.com%3E > Relevant logs: > {noformat} > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/4 is now FAILED (Command exited with code 53) > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor > app-20140411190649-0008/4 removed: Command exited with code 53 > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Executor 4 > disconnected, so removing it > 14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4 > (already removed): Failed to create local directory (bad spark.local.dir?) > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor added: > app-20140411190649-0008/27 on > worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 > (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores > 14/04/11 19:06:52 INFO cluster.SparkDeploySchedulerBackend: Granted executor > ID app-20140411190649-0008/27 on hostPort > ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM > 14/04/11 19:06:52 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/27 is now RUNNING > 14/04/11 19:06:52 INFO storage.BlockManagerMasterActor$BlockManagerInfo: > Registering block manager ip-172-31-24-76.us-west-1.compute.internal:50256 > with 32.7 GB RAM > 14/04/11 19:06:52 INFO metastore.HiveMetaStore: 0: get_table : db=default > tbl=wikistats_pd > 14/04/11 19:06:52 INFO HiveMetaStore.audit: ugi=root ip=unknown-ip-addr > cmd=get_table : db=default tbl=wikistats_pd > 14/04/11 19:06:53 DEBUG hive.log: DDL: struct wikistats_pd { string > projectcode, string pagename, i32 pageviews, i32 bytes} > 14/04/11 19:06:53 DEBUG lazy.LazySimpleSerDe: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe initialized with: > columnNames=[projectcode, pagename, pageviews, bytes] columnTypes=[string, > string, int, int] separator=[[B@29a81175] nullstring=\N > lastColumnTakesRest=false > shark> 14/04/11 19:06:55 INFO cluster.SparkDeploySchedulerBackend: Registered > executor: > Actor[akka.tcp://sparkexecu...@ip-172-31-19-11.us-west-1.compute.internal:45248/user/Executor#-1002203295] > with ID 27 > show 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor 27 > disconnected, so removing it > 14/04/11 19:06:56 ERROR scheduler.TaskSchedulerImpl: Lost an executor 27 > (already removed): remote Akka client disassociated > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/27 is now FAILED (Command exited with code 53) > 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Executor > app-20140411190649-0008/27 removed: Command exited with code 53 > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor added: > app-20140411190649-0008/28 on > worker-20140409212012-ip-172-31-19-11.us-west-1.compute.internal-58614 > (ip-172-31-19-11.us-west-1.compute.internal:58614) with 8 cores > 14/04/11 19:06:56 INFO cluster.SparkDeploySchedulerBackend: Granted executor > ID app-20140411190649-0008/28 on hostPort > ip-172-31-19-11.us-west-1.compute.internal:58614 with 8 cores, 56.9 GB RAM > 14/04/11 19:06:56 INFO client.AppClient$ClientActor: Executor updated: > app-20140411190649-0008/28 is now RUNNING > tables; > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6772) spark sql error when running code on large number of records
Aditya Parmar created SPARK-6772: Summary: spark sql error when running code on large number of records Key: SPARK-6772 URL: https://issues.apache.org/jira/browse/SPARK-6772 Project: Spark Issue Type: Test Components: SQL Affects Versions: 1.2.0 Reporter: Aditya Parmar Hi all , I am getting an Arrayoutboundsindex error when i try to run a simple filtering colums query on a file with 2.5 lac records.runs fine when running on a file with 2k records . 15/04/08 16:54:06 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 3, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2) on executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 1] 15/04/08 16:54:06 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 4, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4) on executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 2] 15/04/08 16:54:06 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 5, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) 15/04/08 16:54:06 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 3) on executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 3] 15/04/08 16:54:06 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 6, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 5) on executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 4] 15/04/08 16:54:06 ERROR TaskSetManager: Task 1 in stage 0.0 failed 4 times; aborting job 15/04/08 16:54:06 INFO TaskSchedulerImpl: Cancelling stage 0 15/04/08 16:54:06 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/04/08 16:54:06 INFO DAGScheduler: Job 0 failed: saveAsTextFile at JavaSchemaRDD.scala:42, took 1.914477 s Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 5, blrwfl11189.igatecorp.com): java.lang.ArrayIndexOutOfBoundsException Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [aditya@blrwfl11189 ~]$ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6773) check -license will passed in next time when rat jar download failed.
June created SPARK-6773: --- Summary: check -license will passed in next time when rat jar download failed. Key: SPARK-6773 URL: https://issues.apache.org/jira/browse/SPARK-6773 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.3.0 Reporter: June In dev/check-license, it will download Rat jar if it not exist. if download failed, it will report error: ** Attempting to fetch rat Our attempt to download rat locally to /home/spark/hejun/sparkgit/spark/lib/apache-rat-0.10.jar failed. Please install rat manually. * but if run it again in next cycle, it will check RAT passed and go on building also an error reported: ** Error: Invalid or corrupt jarfile /home/spark/hejun/sparkgit/spark/lib/apache-rat-0.10.jar RAT checks passed. * This is because: 1. The error tmp rat.jar is not removed when rat jar download failed in last time. So it will go on checking license using the error rat.jar 2. the rat-results.txt is empty because rat.jar run failed, so RAT faild. Suggest: 1. Add a clean step when rat.jar download faild. 2. Add a error checking logic after run rat checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6773) check -license will passed in next time when rat jar download failed.
[ https://issues.apache.org/jira/browse/SPARK-6773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] June updated SPARK-6773: Description: In dev/check-license, it will download Rat jar if it not exist. if download failed, it will report error: ** Attempting to fetch rat Our attempt to download rat locally to /home/spark/hejun/sparkgit/spark/lib/apache-rat-0.10.jar failed. Please install rat manually. * but if run it again in next cycle, it will check RAT passed and go on building also an error reported: ** Error: Invalid or corrupt jarfile /home/spark/hejun/sparkgit/spark/lib/apache-rat-0.10.jar RAT checks passed. * This is because: 1. The error tmp rat.jar is not removed when rat jar download failed in last time. So it will go on checking license using the error rat.jar 2. The rat-results.txt is empty because rat.jar run failed, so RAT check passed. Suggest: 1. Add a clean step when rat.jar download faild. 2. Add a error checking logic after run rat checking. was: In dev/check-license, it will download Rat jar if it not exist. if download failed, it will report error: ** Attempting to fetch rat Our attempt to download rat locally to /home/spark/hejun/sparkgit/spark/lib/apache-rat-0.10.jar failed. Please install rat manually. * but if run it again in next cycle, it will check RAT passed and go on building also an error reported: ** Error: Invalid or corrupt jarfile /home/spark/hejun/sparkgit/spark/lib/apache-rat-0.10.jar RAT checks passed. * This is because: 1. The error tmp rat.jar is not removed when rat jar download failed in last time. So it will go on checking license using the error rat.jar 2. the rat-results.txt is empty because rat.jar run failed, so RAT faild. Suggest: 1. Add a clean step when rat.jar download faild. 2. Add a error checking logic after run rat checking. > check -license will passed in next time when rat jar download failed. > - > > Key: SPARK-6773 > URL: https://issues.apache.org/jira/browse/SPARK-6773 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.0 >Reporter: June > > In dev/check-license, it will download Rat jar if it not exist. if download > failed, it will report error: > ** > Attempting to fetch rat > Our attempt to download rat locally to > /home/spark/hejun/sparkgit/spark/lib/apache-rat-0.10.jar failed. Please > install rat manually. > * > but if run it again in next cycle, it will check RAT passed and go on > building also an error reported: > ** > Error: Invalid or corrupt jarfile > /home/spark/hejun/sparkgit/spark/lib/apache-rat-0.10.jar > RAT checks passed. > * > This is because: > 1. The error tmp rat.jar is not removed when rat jar download failed in last > time. So it will go on checking license using the error rat.jar > 2. The rat-results.txt is empty because rat.jar run failed, so RAT check > passed. > Suggest: > 1. Add a clean step when rat.jar download faild. > 2. Add a error checking logic after run rat checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6773) check -license will passed in next time when rat jar download failed.
[ https://issues.apache.org/jira/browse/SPARK-6773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6773: --- Assignee: Apache Spark > check -license will passed in next time when rat jar download failed. > - > > Key: SPARK-6773 > URL: https://issues.apache.org/jira/browse/SPARK-6773 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.0 >Reporter: June >Assignee: Apache Spark > > In dev/check-license, it will download Rat jar if it not exist. if download > failed, it will report error: > ** > Attempting to fetch rat > Our attempt to download rat locally to > /home/spark/hejun/sparkgit/spark/lib/apache-rat-0.10.jar failed. Please > install rat manually. > * > but if run it again in next cycle, it will check RAT passed and go on > building also an error reported: > ** > Error: Invalid or corrupt jarfile > /home/spark/hejun/sparkgit/spark/lib/apache-rat-0.10.jar > RAT checks passed. > * > This is because: > 1. The error tmp rat.jar is not removed when rat jar download failed in last > time. So it will go on checking license using the error rat.jar > 2. The rat-results.txt is empty because rat.jar run failed, so RAT check > passed. > Suggest: > 1. Add a clean step when rat.jar download faild. > 2. Add a error checking logic after run rat checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6773) check -license will passed in next time when rat jar download failed.
[ https://issues.apache.org/jira/browse/SPARK-6773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485156#comment-14485156 ] Apache Spark commented on SPARK-6773: - User 'sisihj' has created a pull request for this issue: https://github.com/apache/spark/pull/5421 > check -license will passed in next time when rat jar download failed. > - > > Key: SPARK-6773 > URL: https://issues.apache.org/jira/browse/SPARK-6773 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.0 >Reporter: June > > In dev/check-license, it will download Rat jar if it not exist. if download > failed, it will report error: > ** > Attempting to fetch rat > Our attempt to download rat locally to > /home/spark/hejun/sparkgit/spark/lib/apache-rat-0.10.jar failed. Please > install rat manually. > * > but if run it again in next cycle, it will check RAT passed and go on > building also an error reported: > ** > Error: Invalid or corrupt jarfile > /home/spark/hejun/sparkgit/spark/lib/apache-rat-0.10.jar > RAT checks passed. > * > This is because: > 1. The error tmp rat.jar is not removed when rat jar download failed in last > time. So it will go on checking license using the error rat.jar > 2. The rat-results.txt is empty because rat.jar run failed, so RAT check > passed. > Suggest: > 1. Add a clean step when rat.jar download faild. > 2. Add a error checking logic after run rat checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6773) check -license will passed in next time when rat jar download failed.
[ https://issues.apache.org/jira/browse/SPARK-6773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6773: --- Assignee: (was: Apache Spark) > check -license will passed in next time when rat jar download failed. > - > > Key: SPARK-6773 > URL: https://issues.apache.org/jira/browse/SPARK-6773 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.3.0 >Reporter: June > > In dev/check-license, it will download Rat jar if it not exist. if download > failed, it will report error: > ** > Attempting to fetch rat > Our attempt to download rat locally to > /home/spark/hejun/sparkgit/spark/lib/apache-rat-0.10.jar failed. Please > install rat manually. > * > but if run it again in next cycle, it will check RAT passed and go on > building also an error reported: > ** > Error: Invalid or corrupt jarfile > /home/spark/hejun/sparkgit/spark/lib/apache-rat-0.10.jar > RAT checks passed. > * > This is because: > 1. The error tmp rat.jar is not removed when rat jar download failed in last > time. So it will go on checking license using the error rat.jar > 2. The rat-results.txt is empty because rat.jar run failed, so RAT check > passed. > Suggest: > 1. Add a clean step when rat.jar download faild. > 2. Add a error checking logic after run rat checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5960) Allow AWS credentials to be passed to KinesisUtils.createStream()
[ https://issues.apache.org/jira/browse/SPARK-5960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485168#comment-14485168 ] Cihan Emre commented on SPARK-5960: --- [~cfregly] Are you working on this? I might be able to help if it's ok for you. > Allow AWS credentials to be passed to KinesisUtils.createStream() > - > > Key: SPARK-5960 > URL: https://issues.apache.org/jira/browse/SPARK-5960 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.1.0 >Reporter: Chris Fregly >Assignee: Chris Fregly > > While IAM roles are preferable, we're seeing a lot of cases where we need to > pass AWS credentials when creating the KinesisReceiver. > Notes: > * Make sure we don't log the credentials anywhere > * Maintain compatibility with existing KinesisReceiver-based code. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6681) JAVA_HOME error with upgrade to Spark 1.3.0
[ https://issues.apache.org/jira/browse/SPARK-6681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485170#comment-14485170 ] Ken Williams commented on SPARK-6681: - Sounds good - I haven't been able to investigate it further, but I'll put any further info here if/when I find it. > JAVA_HOME error with upgrade to Spark 1.3.0 > --- > > Key: SPARK-6681 > URL: https://issues.apache.org/jira/browse/SPARK-6681 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.3.0 > Environment: Client is Mac OS X version 10.10.2, cluster is running > HDP 2.1 stack. >Reporter: Ken Williams > > I’m trying to upgrade a Spark project, written in Scala, from Spark 1.2.1 to > 1.3.0, so I changed my `build.sbt` like so: > {code} > -libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.1" % > "provided" > +libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % > "provided" > {code} > then make an `assembly` jar, and submit it: > {code} > HADOOP_CONF_DIR=/etc/hadoop/conf \ > spark-submit \ > --driver-class-path=/etc/hbase/conf \ > --conf spark.hadoop.validateOutputSpecs=false \ > --conf > spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.3.0-hadoop2.4.0.jar \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --deploy-mode=cluster \ > --master=yarn \ > --class=TestObject \ > --num-executors=54 \ > target/scala-2.11/myapp-assembly-1.2.jar > {code} > The job fails to submit, with the following exception in the terminal: > {code} > 15/03/19 10:30:07 INFO yarn.Client: > 15/03/19 10:20:03 INFO yarn.Client: >client token: N/A >diagnostics: Application application_1420225286501_4698 failed 2 times > due to AM > Container for appattempt_1420225286501_4698_02 exited with > exitCode: 127 > due to: Exception from container-launch: > org.apache.hadoop.util.Shell$ExitCodeException: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) > at org.apache.hadoop.util.Shell.run(Shell.java:379) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > {code} > Finally, I go and check the YARN app master’s web interface (since the job is > there, I know it at least made it that far), and the only logs it shows are > these: > {code} > Log Type: stderr > Log Length: 61 > /bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory > > Log Type: stdout > Log Length: 0 > {code} > I’m not sure how to interpret that - is {{ {{JAVA_HOME}} }} a literal > (including the brackets) that’s somehow making it into a script? Is this > coming from the worker nodes or the driver? Anything I can do to experiment > & troubleshoot? > I do have {{JAVA_HOME}} set in the hadoop config files on all the nodes of > the cluster: > {code} > % grep JAVA_HOME /etc/hadoop/conf/*.sh > /etc/hadoop/conf/hadoop-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31 > /etc/hadoop/conf/yarn-env.sh:export JAVA_HOME=/usr/jdk64/jdk1.6.0_31 > {code} > Has this behavior changed in 1.3.0 since 1.2.1? Using 1.2.1 and making no > other changes, the job completes fine. > (Note: I originally posted this on the Spark mailing list and also on Stack > Overflow, I'll update both places if/when I find a solution.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6774) Implement Parquet complex types backwards-compatiblity rules
Cheng Lian created SPARK-6774: - Summary: Implement Parquet complex types backwards-compatiblity rules Key: SPARK-6774 URL: https://issues.apache.org/jira/browse/SPARK-6774 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.2.1, 1.1.1, 1.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Before [Parquet format PR #17|https://github.com/apache/incubator-parquet-format/pull/17], representations of Parquet complex types were not standardized. Different systems just followed their own rules. This left lots of legacy Parquet files -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6774) Implement Parquet complex types backwards-compatiblity rules
[ https://issues.apache.org/jira/browse/SPARK-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6774: -- Description: [Parquet format PR #17|https://github.com/apache/incubator-parquet-format/pull/17] standardized representation of Parquet complex types and listed backwards-compatibility rules. Spark SQL should implement these compatibility rules to improve interoperatability. (was: Before [Parquet format PR #17|https://github.com/apache/incubator-parquet-format/pull/17], representations of Parquet complex types were not standardized. Different systems just followed their own rules. This left lots of legacy Parquet files ) > Implement Parquet complex types backwards-compatiblity rules > > > Key: SPARK-6774 > URL: https://issues.apache.org/jira/browse/SPARK-6774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.0, 1.1.1, 1.2.1, 1.3.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > [Parquet format PR > #17|https://github.com/apache/incubator-parquet-format/pull/17] standardized > representation of Parquet complex types and listed backwards-compatibility > rules. Spark SQL should implement these compatibility rules to improve > interoperatability. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6775) Simplify CatalystConverter class hierarchy and pass in Parquet schema
Cheng Lian created SPARK-6775: - Summary: Simplify CatalystConverter class hierarchy and pass in Parquet schema Key: SPARK-6775 URL: https://issues.apache.org/jira/browse/SPARK-6775 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0, 1.2.1, 1.1.1, 1.0.2 Reporter: Cheng Lian Assignee: Cheng Lian {{CataystConverter}} classes are used to convert Parquet records to Spark SQL row objects. Current converter implementations have the following problems: # They simply ignore original Parquet schema, which makes adding Parquet backwards-compatibility rules impossible. # They are unnecessary over complicated. # {{SpecificMutableRow}} is only used for structs whose fields are all of primitive types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6776) Implement backwards-compatibility rules in CatalystConverters
Cheng Lian created SPARK-6776: - Summary: Implement backwards-compatibility rules in CatalystConverters Key: SPARK-6776 URL: https://issues.apache.org/jira/browse/SPARK-6776 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0, 1.2.1, 1.1.1, 1.0.2 Reporter: Cheng Lian Assignee: Cheng Lian -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6776) Implement backwards-compatibility rules in CatalystConverters
[ https://issues.apache.org/jira/browse/SPARK-6776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-6776: -- Description: Spark SQL should also be able to read Parquet complex types represented in several commonly used non-standard way. For example, legacy files written by parquet-avro, parquet-thrift, and parquet-hive. We may just follow the pattern used in {{AvroIndexedRecordConverter}}. > Implement backwards-compatibility rules in CatalystConverters > - > > Key: SPARK-6776 > URL: https://issues.apache.org/jira/browse/SPARK-6776 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Spark SQL should also be able to read Parquet complex types represented in > several commonly used non-standard way. For example, legacy files written by > parquet-avro, parquet-thrift, and parquet-hive. We may just follow the > pattern used in {{AvroIndexedRecordConverter}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6777) Implement backwards-compatibility rules in Parquet schema converters
Cheng Lian created SPARK-6777: - Summary: Implement backwards-compatibility rules in Parquet schema converters Key: SPARK-6777 URL: https://issues.apache.org/jira/browse/SPARK-6777 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.3.0, 1.2.1, 1.1.1, 1.0.2 Reporter: Cheng Lian Assignee: Cheng Lian When converting a Parquet schema to Spark SQL schema, we should recognize commonly used legacy non-standard representation of complex types. We can follow the pattern used in Parquet's {{AvroSchemaConverter}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6775) Simplify CatalystConverter class hierarchy and pass in Parquet schema
[ https://issues.apache.org/jira/browse/SPARK-6775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6775: --- Assignee: Cheng Lian (was: Apache Spark) > Simplify CatalystConverter class hierarchy and pass in Parquet schema > - > > Key: SPARK-6775 > URL: https://issues.apache.org/jira/browse/SPARK-6775 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > {{CataystConverter}} classes are used to convert Parquet records to Spark SQL > row objects. Current converter implementations have the following problems: > # They simply ignore original Parquet schema, which makes adding Parquet > backwards-compatibility rules impossible. > # They are unnecessary over complicated. > # {{SpecificMutableRow}} is only used for structs whose fields are all of > primitive types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6776) Implement backwards-compatibility rules in CatalystConverters
[ https://issues.apache.org/jira/browse/SPARK-6776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485221#comment-14485221 ] Apache Spark commented on SPARK-6776: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/5422 > Implement backwards-compatibility rules in CatalystConverters > - > > Key: SPARK-6776 > URL: https://issues.apache.org/jira/browse/SPARK-6776 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Spark SQL should also be able to read Parquet complex types represented in > several commonly used non-standard way. For example, legacy files written by > parquet-avro, parquet-thrift, and parquet-hive. We may just follow the > pattern used in {{AvroIndexedRecordConverter}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6775) Simplify CatalystConverter class hierarchy and pass in Parquet schema
[ https://issues.apache.org/jira/browse/SPARK-6775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6775: --- Assignee: Apache Spark (was: Cheng Lian) > Simplify CatalystConverter class hierarchy and pass in Parquet schema > - > > Key: SPARK-6775 > URL: https://issues.apache.org/jira/browse/SPARK-6775 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 >Reporter: Cheng Lian >Assignee: Apache Spark > > {{CataystConverter}} classes are used to convert Parquet records to Spark SQL > row objects. Current converter implementations have the following problems: > # They simply ignore original Parquet schema, which makes adding Parquet > backwards-compatibility rules impossible. > # They are unnecessary over complicated. > # {{SpecificMutableRow}} is only used for structs whose fields are all of > primitive types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6776) Implement backwards-compatibility rules in CatalystConverters
[ https://issues.apache.org/jira/browse/SPARK-6776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6776: --- Assignee: Apache Spark (was: Cheng Lian) > Implement backwards-compatibility rules in CatalystConverters > - > > Key: SPARK-6776 > URL: https://issues.apache.org/jira/browse/SPARK-6776 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 >Reporter: Cheng Lian >Assignee: Apache Spark > > Spark SQL should also be able to read Parquet complex types represented in > several commonly used non-standard way. For example, legacy files written by > parquet-avro, parquet-thrift, and parquet-hive. We may just follow the > pattern used in {{AvroIndexedRecordConverter}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6776) Implement backwards-compatibility rules in CatalystConverters
[ https://issues.apache.org/jira/browse/SPARK-6776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6776: --- Assignee: Cheng Lian (was: Apache Spark) > Implement backwards-compatibility rules in CatalystConverters > - > > Key: SPARK-6776 > URL: https://issues.apache.org/jira/browse/SPARK-6776 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > Spark SQL should also be able to read Parquet complex types represented in > several commonly used non-standard way. For example, legacy files written by > parquet-avro, parquet-thrift, and parquet-hive. We may just follow the > pattern used in {{AvroIndexedRecordConverter}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6775) Simplify CatalystConverter class hierarchy and pass in Parquet schema
[ https://issues.apache.org/jira/browse/SPARK-6775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485219#comment-14485219 ] Apache Spark commented on SPARK-6775: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/5422 > Simplify CatalystConverter class hierarchy and pass in Parquet schema > - > > Key: SPARK-6775 > URL: https://issues.apache.org/jira/browse/SPARK-6775 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > {{CataystConverter}} classes are used to convert Parquet records to Spark SQL > row objects. Current converter implementations have the following problems: > # They simply ignore original Parquet schema, which makes adding Parquet > backwards-compatibility rules impossible. > # They are unnecessary over complicated. > # {{SpecificMutableRow}} is only used for structs whose fields are all of > primitive types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-1537: --- Assignee: Apache Spark (was: Marcelo Vanzin) > Add integration with Yarn's Application Timeline Server > --- > > Key: SPARK-1537 > URL: https://issues.apache.org/jira/browse/SPARK-1537 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Marcelo Vanzin >Assignee: Apache Spark > Attachments: SPARK-1537.txt, spark-1573.patch > > > It would be nice to have Spark integrate with Yarn's Application Timeline > Server (see YARN-321, YARN-1530). This would allow users running Spark on > Yarn to have a single place to go for all their history needs, and avoid > having to manage a separate service (Spark's built-in server). > At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, > although there is still some ongoing work. But the basics are there, and I > wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485258#comment-14485258 ] Apache Spark commented on SPARK-1537: - User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/5423 > Add integration with Yarn's Application Timeline Server > --- > > Key: SPARK-1537 > URL: https://issues.apache.org/jira/browse/SPARK-1537 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Attachments: SPARK-1537.txt, spark-1573.patch > > > It would be nice to have Spark integrate with Yarn's Application Timeline > Server (see YARN-321, YARN-1530). This would allow users running Spark on > Yarn to have a single place to go for all their history needs, and avoid > having to manage a separate service (Spark's built-in server). > At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, > although there is still some ongoing work. But the basics are there, and I > wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-1537: --- Assignee: Marcelo Vanzin (was: Apache Spark) > Add integration with Yarn's Application Timeline Server > --- > > Key: SPARK-1537 > URL: https://issues.apache.org/jira/browse/SPARK-1537 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Attachments: SPARK-1537.txt, spark-1573.patch > > > It would be nice to have Spark integrate with Yarn's Application Timeline > Server (see YARN-321, YARN-1530). This would allow users running Spark on > Yarn to have a single place to go for all their history needs, and avoid > having to manage a separate service (Spark's built-in server). > At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, > although there is still some ongoing work. But the basics are there, and I > wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5114) Should Evaluator be a PipelineStage
[ https://issues.apache.org/jira/browse/SPARK-5114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483335#comment-14483335 ] Peter Rudenko edited comment on SPARK-5114 at 4/8/15 2:14 PM: -- +1 for should. For my use case (create pipeline from config file, sometimes there is need to do evaluation with several custom metrics (e.g. gini norm, etc.), sometimes there's no need to do evaluation, it would be done on other part of the system). Would be more flexible for if evaluator would be a part of pipeline. was (Author: prudenko): +1 for should. For my use case (create pipeline from config file, sometimes there is need to do evaluation with custom metrics (e.g. gini norm, etc.), sometimes there's no need to do evaluation, it would be done on other part of the system). Would be more flexible for if evaluator would be a part of pipeline. > Should Evaluator be a PipelineStage > --- > > Key: SPARK-5114 > URL: https://issues.apache.org/jira/browse/SPARK-5114 > Project: Spark > Issue Type: Question > Components: ML >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > Pipelines can currently contain Estimators and Transformers. > Question for debate: Should Pipelines be able to contain Evaluators? > Pros: > * Evaluators take input datasets with particular schema, which should perhaps > be checked before running a Pipeline. > Cons: > * Evaluators do not transform datasets. They produce a scalar (or a few > values), which makes it hard to say how they fit into a Pipeline or a > PipelineModel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6778) SQL contexts in spark-shell and pyspark should both be called sqlContext
Matei Zaharia created SPARK-6778: Summary: SQL contexts in spark-shell and pyspark should both be called sqlContext Key: SPARK-6778 URL: https://issues.apache.org/jira/browse/SPARK-6778 Project: Spark Issue Type: Bug Components: PySpark, Spark Shell Reporter: Matei Zaharia For some reason the Python one is only called sqlCtx. This is pretty confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5957) Better handling of default parameter values.
[ https://issues.apache.org/jira/browse/SPARK-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-5957: Assignee: Xiangrui Meng > Better handling of default parameter values. > > > Key: SPARK-5957 > URL: https://issues.apache.org/jira/browse/SPARK-5957 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We store the default value of a parameter in the Param instance. In many > cases, the default value depends on the algorithm and other parameters > defined in the same algorithm. We need to think a better approach to handle > default parameter values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5874) How to improve the current ML pipeline API?
[ https://issues.apache.org/jira/browse/SPARK-5874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5874: - Description: I created this JIRA to collect feedbacks about the ML pipeline API we introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 with confidence, which requires valuable input from the community. I'll create sub-tasks for each major issue. Design doc (WIP): https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit# was:I created this JIRA to collect feedbacks about the ML pipeline API we introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 with confidence, which requires valuable input from the community. I'll create sub-tasks for each major issue. > How to improve the current ML pipeline API? > --- > > Key: SPARK-5874 > URL: https://issues.apache.org/jira/browse/SPARK-5874 > Project: Spark > Issue Type: Brainstorming > Components: ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > I created this JIRA to collect feedbacks about the ML pipeline API we > introduced in Spark 1.2. The target is to graduate this set of APIs in 1.4 > with confidence, which requires valuable input from the community. I'll > create sub-tasks for each major issue. > Design doc (WIP): > https://docs.google.com/a/databricks.com/document/d/1plFBPJY_PriPTuMiFYLSm7fQgD1FieP4wt3oMVKMGcc/edit# -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5957) Better handling of default parameter values.
[ https://issues.apache.org/jira/browse/SPARK-5957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5957: - Description: We store the default value of a parameter in the Param instance. In many cases, the default value depends on the algorithm and other parameters defined in the same algorithm. We need to think a better approach to handle default parameter values. The design doc was posted in the parent JIRA: https://issues.apache.org/jira/browse/SPARK-5874 was:We store the default value of a parameter in the Param instance. In many cases, the default value depends on the algorithm and other parameters defined in the same algorithm. We need to think a better approach to handle default parameter values. > Better handling of default parameter values. > > > Key: SPARK-5957 > URL: https://issues.apache.org/jira/browse/SPARK-5957 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We store the default value of a parameter in the Param instance. In many > cases, the default value depends on the algorithm and other parameters > defined in the same algorithm. We need to think a better approach to handle > default parameter values. > The design doc was posted in the parent JIRA: > https://issues.apache.org/jira/browse/SPARK-5874 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6677) pyspark.sql nondeterministic issue with row fields
[ https://issues.apache.org/jira/browse/SPARK-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485521#comment-14485521 ] Davies Liu commented on SPARK-6677: --- What is the expected output? I got this: {code} key: a res1 data as row: [Row(foo=1, key=u'a')] res2 data as row: [Row(bar=3, key=u'a', other=u'foobar')] res1 and res2 fields: (u'foo', u'key') (u'bar', u'key', u'other') res1 data as tuple: 1 a res2 data as tuple: 3 a foobar key: c res1 data as row: [] res2 data as row: [Row(bar=4, key=u'c', other=u'barfoo')] key: b res1 data as row: [Row(foo=2, key=u'b')] res2 data as row: [] {code} > pyspark.sql nondeterministic issue with row fields > -- > > Key: SPARK-6677 > URL: https://issues.apache.org/jira/browse/SPARK-6677 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0 > Environment: spark version: spark-1.3.0-bin-hadoop2.4 > python version: Python 2.7.6 > operating system: MacOS, x86_64 x86_64 x86_64 GNU/Linux >Reporter: Stefano Parmesan > Labels: pyspark, row, sql > > The following issue happens only when running pyspark in the python > interpreter, it works correctly with spark-submit. > Reading two json files containing objects with a different structure leads > sometimes to the definition of wrong Rows, where the fields of a file are > used for the other one. > I was able to write a sample code that reproduce this issue one out of three > times; the code snippet is available at the following link, together with > some (very simple) data samples: > https://gist.github.com/armisael/e08bb4567d0a11efe2db -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6779) Move shared params to param.shared and use code gen
Xiangrui Meng created SPARK-6779: Summary: Move shared params to param.shared and use code gen Key: SPARK-6779 URL: https://issues.apache.org/jira/browse/SPARK-6779 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng The boilerplate code should be automatically generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485554#comment-14485554 ] Alexander Ulanov commented on SPARK-6682: - [~yuu.ishik...@gmail.com] They reside in package org.apache.spark.mllib.optimization: class LBFGS(private var gradient: Gradient, private var updater: Updater) and class GradientDescent private[mllib] (private var gradient: Gradient, private var updater: Updater). They extend Optimizer trait that has only one function: def optimize(data: RDD[(Double, Vector)], initialWeights: Vector): Vector. This function is limited to only one type of input: vectors and their labels. I have submitted a separate issue regarding this https://issues.apache.org/jira/browse/SPARK-5362. 1. Right now static methods work with hard-coded optimizers, such as LogisticRegressionWithSGD. This is not very convenient. I think moving away from static methods and use builders implies that optimizers also could be set by users. It will be a problem because current optimizers require Updater and Gradient at the creation time. 2. The workaround I suggested in the previous post addresses this. > Deprecate static train and use builder instead for Scala/Java > - > > Key: SPARK-6682 > URL: https://issues.apache.org/jira/browse/SPARK-6682 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > In MLlib, we have for some time been unofficially moving away from the old > static train() methods and moving towards builder patterns. This JIRA is to > discuss this move and (hopefully) make it official. > "Old static train()" API: > {code} > val myModel = NaiveBayes.train(myData, ...) > {code} > "New builder pattern" API: > {code} > val nb = new NaiveBayes().setLambda(0.1) > val myModel = nb.train(myData) > {code} > Pros of the builder pattern: > * Much less code when algorithms have many parameters. Since Java does not > support default arguments, we required *many* duplicated static train() > methods (for each prefix set of arguments). > * Helps to enforce default parameters. Users should ideally not have to even > think about setting parameters if they just want to try an algorithm quickly. > * Matches spark.ml API > Cons of the builder pattern: > * In Python APIs, static train methods are more "Pythonic." > Proposal: > * Scala/Java: We should start deprecating the old static train() methods. We > must keep them for API stability, but deprecating will help with API > consistency, making it clear that everyone should use the builder pattern. > As we deprecate them, we should make sure that the builder pattern supports > all parameters. > * Python: Keep static train methods. > CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set
[ https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-6506. --- Resolution: Fixed Fix Version/s: 1.4.0 1.3.2 Issue resolved by pull request 5405 [https://github.com/apache/spark/pull/5405] > python support yarn cluster mode requires SPARK_HOME to be set > -- > > Key: SPARK-6506 > URL: https://issues.apache.org/jira/browse/SPARK-6506 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.0 >Reporter: Thomas Graves > Fix For: 1.3.2, 1.4.0 > > > We added support for python running in yarn cluster mode in > https://issues.apache.org/jira/browse/SPARK-5173, but it requires that > SPARK_HOME be set in the environment variables for application master and > executor. It doesn't have to be set to anything real but it fails if its not > set. See the command at the end of: https://github.com/apache/spark/pull/3976 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6506) python support yarn cluster mode requires SPARK_HOME to be set
[ https://issues.apache.org/jira/browse/SPARK-6506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-6506: -- Assignee: Marcelo Vanzin > python support yarn cluster mode requires SPARK_HOME to be set > -- > > Key: SPARK-6506 > URL: https://issues.apache.org/jira/browse/SPARK-6506 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.0 >Reporter: Thomas Graves >Assignee: Marcelo Vanzin > Fix For: 1.3.2, 1.4.0 > > > We added support for python running in yarn cluster mode in > https://issues.apache.org/jira/browse/SPARK-5173, but it requires that > SPARK_HOME be set in the environment variables for application master and > executor. It doesn't have to be set to anything real but it fails if its not > set. See the command at the end of: https://github.com/apache/spark/pull/3976 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6780) Add saveAsTextFileByKey method for PySpark
Ilya Ganelin created SPARK-6780: --- Summary: Add saveAsTextFileByKey method for PySpark Key: SPARK-6780 URL: https://issues.apache.org/jira/browse/SPARK-6780 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Ilya Ganelin The PySpark API should have a method to allow saving a key-value RDD to subdirectories organized by key as in : https://issues.apache.org/jira/browse/SPARK-3533 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6780) Add saveAsTextFileByKey method for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485577#comment-14485577 ] Ilya Ganelin commented on SPARK-6780: - SPARK-3533 defines matching methods for Scala and Java APIs. > Add saveAsTextFileByKey method for PySpark > -- > > Key: SPARK-6780 > URL: https://issues.apache.org/jira/browse/SPARK-6780 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Ilya Ganelin > > The PySpark API should have a method to allow saving a key-value RDD to > subdirectories organized by key as in : > https://issues.apache.org/jira/browse/SPARK-3533 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6780) Add saveAsTextFileByKey method for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485582#comment-14485582 ] Ilya Ganelin commented on SPARK-6780: - This code was my attempt to implement this within PythonRDD.scala but I ran into run-time reflection issues I could not solve. {code} /** * Output a Python RDD of key-value pairs to any Hadoop file system such that the values within * the rdd are written to sub-directories organized by the associated key. * * Keys and values are converted to suitable output types using either user specified converters * or, if not specified, [[org.apache.spark.api.python.JavaToWritableConverter]]. Post-conversion * types `keyClass` and `valueClass` are automatically inferred if not specified. The passed-in * `confAsMap` is merged with the default Hadoop conf associated with the SparkContext of * this RDD. */ def saveAsHadoopFileByKey[K, V, C <: CompressionCodec]( pyRDD: JavaRDD[Array[Byte]], batchSerialized: Boolean, path: String, outputFormatClass: String, keyClass: String, valueClass: String, keyConverterClass: String, valueConverterClass: String, confAsMap: java.util.HashMap[String, String], compressionCodecClass: String) = { val rdd = SerDeUtil.pythonToPairRDD(pyRDD, batchSerialized) val (kc, vc) = getKeyValueTypes(keyClass, valueClass).getOrElse( inferKeyValueTypes(rdd, keyConverterClass, valueConverterClass)) val mergedConf = getMergedConf(confAsMap, pyRDD.context.hadoopConfiguration) val codec = Option(compressionCodecClass).map(Utils.classForName(_).asInstanceOf[Class[C]]) val converted = convertRDD(rdd, keyConverterClass, valueConverterClass, new JavaToWritableConverter) converted.saveAsHadoopFile(path, ClassUtils.primitiveToWrapper(kc), ClassUtils.primitiveToWrapper(vc), classOf[RDDMultipleTextOutputFormat[K,V]], new JobConf(mergedConf), codec=codec) } {code} > Add saveAsTextFileByKey method for PySpark > -- > > Key: SPARK-6780 > URL: https://issues.apache.org/jira/browse/SPARK-6780 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Ilya Ganelin > > The PySpark API should have a method to allow saving a key-value RDD to > subdirectories organized by key as in : > https://issues.apache.org/jira/browse/SPARK-3533 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6780) Add saveAsTextFileByKey method for PySpark
[ https://issues.apache.org/jira/browse/SPARK-6780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485586#comment-14485586 ] Ilya Ganelin commented on SPARK-6780: - Matching test code: {code} test("saveAsHadoopFileByKey should generate a text file per key") { val testPairs : JavaRDD[Array[Byte]] = sc.parallelize( Seq( Array(1.toByte,1.toByte), Array(2.toByte,4.toByte), Array(3.toByte,9.toByte), Array(4.toByte,16.toByte), Array(5.toByte,25.toByte)) ).toJavaRDD() val fs = FileSystem.get(new Configuration()) val basePath = sc.conf.get("spark.local.dir", "/tmp") val fullPath = basePath + "/testPath" fs.delete(new Path(fullPath), true) PythonRDD.saveAsHadoopFileByKey( testPairs, false, fullPath, classOf[RDDMultipleTextOutputFormat].toString, classOf[Int].toString, classOf[Int].toString, null, null, new java.util.HashMap(), "") // Test that a file was created for each key (1 to 5).foreach(key => { val testPath = new Path(fullPath + "/" + key) assert(fs.exists(testPath)) // Read the file and test that the contents are the values matching that key split by line val input = fs.open(testPath) val reader = new BufferedReader(new InputStreamReader(input)) val values = new HashSet[Int] val lines = Stream.continually(reader.readLine()).takeWhile(_ != null) lines.foreach(s => values += s.toInt) assert(values.contains(key*key)) }) fs.delete(new Path(fullPath), true) } {code} > Add saveAsTextFileByKey method for PySpark > -- > > Key: SPARK-6780 > URL: https://issues.apache.org/jira/browse/SPARK-6780 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Ilya Ganelin > > The PySpark API should have a method to allow saving a key-value RDD to > subdirectories organized by key as in : > https://issues.apache.org/jira/browse/SPARK-3533 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6753) Unit test for SPARK-3426 (in ShuffleSuite) doesn't correctly clone the SparkConf
[ https://issues.apache.org/jira/browse/SPARK-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-6753: -- Fix Version/s: 1.4.0 1.3.2 1.1.2 1.2.3 > Unit test for SPARK-3426 (in ShuffleSuite) doesn't correctly clone the > SparkConf > > > Key: SPARK-6753 > URL: https://issues.apache.org/jira/browse/SPARK-6753 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.1.1, 1.2.0, 1.3.0 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > Fix For: 1.1.2, 1.3.2, 1.4.0, 1.2.3 > > > As a result, that test always uses the default shuffle settings, rather than > using the shuffle manager / other settings set by tests that extend > ShuffleSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6753) Unit test for SPARK-3426 (in ShuffleSuite) doesn't correctly clone the SparkConf
[ https://issues.apache.org/jira/browse/SPARK-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-6753: -- Affects Version/s: 1.1.1 1.2.0 > Unit test for SPARK-3426 (in ShuffleSuite) doesn't correctly clone the > SparkConf > > > Key: SPARK-6753 > URL: https://issues.apache.org/jira/browse/SPARK-6753 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.1.1, 1.2.0, 1.3.0 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > Fix For: 1.1.2, 1.3.2, 1.4.0, 1.2.3 > > > As a result, that test always uses the default shuffle settings, rather than > using the shuffle manager / other settings set by tests that extend > ShuffleSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6753) Unit test for SPARK-3426 (in ShuffleSuite) doesn't correctly clone the SparkConf
[ https://issues.apache.org/jira/browse/SPARK-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-6753. --- Resolution: Fixed Fixed by https://github.com/apache/spark/pull/5401 > Unit test for SPARK-3426 (in ShuffleSuite) doesn't correctly clone the > SparkConf > > > Key: SPARK-6753 > URL: https://issues.apache.org/jira/browse/SPARK-6753 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.1.1, 1.2.0, 1.3.0 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > Fix For: 1.1.2, 1.3.2, 1.4.0, 1.2.3 > > > As a result, that test always uses the default shuffle settings, rather than > using the shuffle manager / other settings set by tests that extend > ShuffleSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6772) spark sql error when running code on large number of records
[ https://issues.apache.org/jira/browse/SPARK-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485708#comment-14485708 ] Michael Armbrust commented on SPARK-6772: - Can you provide your code? ArrayIndexOutOfBoundsException is often a bug in user code. > spark sql error when running code on large number of records > > > Key: SPARK-6772 > URL: https://issues.apache.org/jira/browse/SPARK-6772 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 1.2.0 >Reporter: Aditya Parmar > > Hi all , > I am getting an Arrayoutboundsindex error when i try to run a simple > filtering colums query on a file with 2.5 lac records.runs fine when running > on a file with 2k records . > {code} > 15/04/08 16:54:06 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 3, > blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) > 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2) on > executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException > (null) [duplicate 1] > 15/04/08 16:54:06 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 4, > blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) > 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4) on > executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException > (null) [duplicate 2] > 15/04/08 16:54:06 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 5, > blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) > 15/04/08 16:54:06 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 3) on > executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException > (null) [duplicate 3] > 15/04/08 16:54:06 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 6, > blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) > 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 5) on > executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException > (null) [duplicate 4] > 15/04/08 16:54:06 ERROR TaskSetManager: Task 1 in stage 0.0 failed 4 times; > aborting job > 15/04/08 16:54:06 INFO TaskSchedulerImpl: Cancelling stage 0 > 15/04/08 16:54:06 INFO TaskSchedulerImpl: Stage 0 was cancelled > 15/04/08 16:54:06 INFO DAGScheduler: Job 0 failed: saveAsTextFile at > JavaSchemaRDD.scala:42, took 1.914477 s > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: > Lost task 1.3 in stage 0.0 (TID 5, blrwfl11189.igatecorp.com): > java.lang.ArrayIndexOutOfBoundsException > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > [aditya@blrwfl11189 ~]$ > {code} -- This message was sent by Atlassian JIRA (v6.3.4
[jira] [Updated] (SPARK-6772) spark sql error when running code on large number of records
[ https://issues.apache.org/jira/browse/SPARK-6772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6772: Description: Hi all , I am getting an Arrayoutboundsindex error when i try to run a simple filtering colums query on a file with 2.5 lac records.runs fine when running on a file with 2k records . {code} 15/04/08 16:54:06 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 3, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2) on executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 1] 15/04/08 16:54:06 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 4, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4) on executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 2] 15/04/08 16:54:06 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 5, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) 15/04/08 16:54:06 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 3) on executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 3] 15/04/08 16:54:06 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 6, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 5) on executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 4] 15/04/08 16:54:06 ERROR TaskSetManager: Task 1 in stage 0.0 failed 4 times; aborting job 15/04/08 16:54:06 INFO TaskSchedulerImpl: Cancelling stage 0 15/04/08 16:54:06 INFO TaskSchedulerImpl: Stage 0 was cancelled 15/04/08 16:54:06 INFO DAGScheduler: Job 0 failed: saveAsTextFile at JavaSchemaRDD.scala:42, took 1.914477 s Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 5, blrwfl11189.igatecorp.com): java.lang.ArrayIndexOutOfBoundsException Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [aditya@blrwfl11189 ~]$ {code} was: Hi all , I am getting an Arrayoutboundsindex error when i try to run a simple filtering colums query on a file with 2.5 lac records.runs fine when running on a file with 2k records . 15/04/08 16:54:06 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 3, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2) on executor blrwfl11189.igatecorp.com: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 1] 15/04/08 16:54:06 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 4, blrwfl11189.igatecorp.com, PROCESS_LOCAL, 1350 bytes) 15/04/08 16:54:06 INFO TaskSetManager: Lost task 1.2 in stage 0.0 (TID 4) on exe
[jira] [Commented] (SPARK-6440) ipv6 URI for HttpServer
[ https://issues.apache.org/jira/browse/SPARK-6440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485750#comment-14485750 ] Apache Spark commented on SPARK-6440: - User 'nyaapa' has created a pull request for this issue: https://github.com/apache/spark/pull/5424 > ipv6 URI for HttpServer > --- > > Key: SPARK-6440 > URL: https://issues.apache.org/jira/browse/SPARK-6440 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 > Environment: java 7 hotspot, spark 1.3.0, ipv6 only cluster >Reporter: Arsenii Krasikov >Priority: Minor > > In {{org.apache.spark.HttpServer}} uri is generated as {code:java}"spark://" > + localHostname + ":" + masterPort{code}, where {{localHostname}} is > {code:java} org.apache.spark.util.Utils.localHostName() = > customHostname.getOrElse(localIpAddressHostname){code}. If the host has an > ipv6 address then it would be interpolated into invalid URI: > {{spark://fe80:0:0:0:200:f8ff:fe21:67cf:42}} instead of > {{spark://[fe80:0:0:0:200:f8ff:fe21:67cf]:42}}. > The solution is to separate uri and hostname entities. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6781) sqlCtx -> sqlContext in pyspark shell
Michael Armbrust created SPARK-6781: --- Summary: sqlCtx -> sqlContext in pyspark shell Key: SPARK-6781 URL: https://issues.apache.org/jira/browse/SPARK-6781 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Davies Liu Priority: Critical We should be consistent across languages in the default names of things we add to the shells. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6781) sqlCtx -> sqlContext in pyspark shell
[ https://issues.apache.org/jira/browse/SPARK-6781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485757#comment-14485757 ] Michael Armbrust commented on SPARK-6781: - [~pwendell] suggests that we might alias sqlCtx so that both work to avoid a breaking change. We should probably also update documentation to be consistent. > sqlCtx -> sqlContext in pyspark shell > - > > Key: SPARK-6781 > URL: https://issues.apache.org/jira/browse/SPARK-6781 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Davies Liu >Priority: Critical > > We should be consistent across languages in the default names of things we > add to the shells. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6677) pyspark.sql nondeterministic issue with row fields
[ https://issues.apache.org/jira/browse/SPARK-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485772#comment-14485772 ] Stefano Parmesan commented on SPARK-6677: - Hi Davies, Thanks for taking the time to look into this; that's the expected output, in fact. The point is that back then it happened to me randomly, once every five executions or so. Now I'm not able to reproduce it anymore with the data I posted, but I was able to reproduce it again (100% of the times) with the data I just added to the gist; the exception I'm getting is: {noformat} $ ./bin/pyspark ./spark_test.py [...] key: 31491 res1 data as row: [Row(foo=31491, key=u'31491')] res2 data as row: [Row(bar=157455, key=u'31491', other=u'foobar', some=u'thing', that=u'this', this=u'that')] res1 and res2 fields: (u'foo', u'key') (u'bar', u'key', u'other', u'some', u'that', u'this') res1 data as tuple: 31491 31491 res2 data as tuple: 157455 31491 foobar key: 31497 res1 data as row: [] res2 data as row: [Row(foo=157485, key=u'31497')] key: 31495 res1 data as row: [Traceback (most recent call last): File "/path/to/spark-1.3.0-bin-hadoop2.4/./spark_test.py", line 25, in print "res1 data as row:", list(res_x) File "/path/to/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py", line 1214, in __repr__ for n in self.__FIELDS__)) File "/path/to/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py", line 1214, in for n in self.__FIELDS__)) IndexError: tuple index out of range {noformat} which is the same I'm getting in my more complex script. Some considerations: 1) it may be that this particular input leads to the issue only on my machine, therefore I've added another file to generate some random inputs; I've got this exception on two over three randomly-generated samples, please go ahead and run it with different values of {{N}} if the uploaded data does not make pyspark crash in your environment; 2) interestingly, given the sample data, it always crashes on the same key: {{31495}}; however, they do not seem "magic" to me (and of course input files containing just those two elements does not make pyspark crash in any way): {noformat} data/sample_a.json:{"foo": 31495, "key": "31495"} data/sample_b.json:{"other": "foobar", "bar": 157475, "key": "31495", "that": "this", "this": "that", "some": "thing"} {noformat} 3) what happens is that either {{res_x.data\[0\].___FIELDS___}} or {{res_y.data\[0\].___FIELDS___}} get the wrong field names, leading to the {{IndexError}} (the fields are too many and the row does not contain enough data). > pyspark.sql nondeterministic issue with row fields > -- > > Key: SPARK-6677 > URL: https://issues.apache.org/jira/browse/SPARK-6677 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0 > Environment: spark version: spark-1.3.0-bin-hadoop2.4 > python version: Python 2.7.6 > operating system: MacOS, x86_64 x86_64 x86_64 GNU/Linux >Reporter: Stefano Parmesan > Labels: pyspark, row, sql > > The following issue happens only when running pyspark in the python > interpreter, it works correctly with spark-submit. > Reading two json files containing objects with a different structure leads > sometimes to the definition of wrong Rows, where the fields of a file are > used for the other one. > I was able to write a sample code that reproduce this issue one out of three > times; the code snippet is available at the following link, together with > some (very simple) data samples: > https://gist.github.com/armisael/e08bb4567d0a11efe2db -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6677) pyspark.sql nondeterministic issue with row fields
[ https://issues.apache.org/jira/browse/SPARK-6677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-6677: Assignee: Davies Liu > pyspark.sql nondeterministic issue with row fields > -- > > Key: SPARK-6677 > URL: https://issues.apache.org/jira/browse/SPARK-6677 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0 > Environment: spark version: spark-1.3.0-bin-hadoop2.4 > python version: Python 2.7.6 > operating system: MacOS, x86_64 x86_64 x86_64 GNU/Linux >Reporter: Stefano Parmesan >Assignee: Davies Liu > Labels: pyspark, row, sql > > The following issue happens only when running pyspark in the python > interpreter, it works correctly with spark-submit. > Reading two json files containing objects with a different structure leads > sometimes to the definition of wrong Rows, where the fields of a file are > used for the other one. > I was able to write a sample code that reproduce this issue one out of three > times; the code snippet is available at the following link, together with > some (very simple) data samples: > https://gist.github.com/armisael/e08bb4567d0a11efe2db -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)
[ https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485780#comment-14485780 ] Zach Fry commented on SPARK-5594: - [~pwendell], We saw this while running a long Spark Job (>24 hours) with a {{spark.cleaner.ttl}} set to 24 hours. IE: {{spark.cleaner.ttl}} triggers a clean before the job is finished. We were able to repro this pretty easily by setting the {{spark.cleaner.ttl}} to 60 seconds and running a small job. {code} java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_2_piece0 of broadcast_2 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1003) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:138) {code} We were wondering what the purpose of {{spark.cleaner.ttl}} is. Why does it need to be set to begin with (if at all)? Won't the broadcast variables be cleaned up when the job ends and the context is shutdown anyway? > SparkException: Failed to get broadcast (TorrentBroadcast) > -- > > Key: SPARK-5594 > URL: https://issues.apache.org/jira/browse/SPARK-5594 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0, 1.3.0 >Reporter: John Sandiford >Priority: Critical > > I am uncertain whether this is a bug, however I am getting the error below > when running on a cluster (works locally), and have no idea what is causing > it, or where to look for more information. > Any help is appreciated. Others appear to experience the same issue, but I > have not found any solutions online. > Please note that this only happens with certain code and is repeatable, all > my other spark jobs work fine. > ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: > Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: > org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of > broadcast_6 > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 > of broadcast_6 > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119) > at > org.apache.spark
[jira] [Commented] (SPARK-6781) sqlCtx -> sqlContext in pyspark shell
[ https://issues.apache.org/jira/browse/SPARK-6781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485788#comment-14485788 ] Davies Liu commented on SPARK-6781: --- We use `sqlCtx` a lot in doc tests (will be in API docs), should we also change them? > sqlCtx -> sqlContext in pyspark shell > - > > Key: SPARK-6781 > URL: https://issues.apache.org/jira/browse/SPARK-6781 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Davies Liu >Priority: Critical > > We should be consistent across languages in the default names of things we > add to the shells. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6781) sqlCtx -> sqlContext in pyspark shell
[ https://issues.apache.org/jira/browse/SPARK-6781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485801#comment-14485801 ] Michael Armbrust commented on SPARK-6781: - Yeah, I think it would be better to be consistent everywhere. > sqlCtx -> sqlContext in pyspark shell > - > > Key: SPARK-6781 > URL: https://issues.apache.org/jira/browse/SPARK-6781 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Davies Liu >Priority: Critical > > We should be consistent across languages in the default names of things we > add to the shells. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6781) sqlCtx -> sqlContext in pyspark shell
[ https://issues.apache.org/jira/browse/SPARK-6781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6781: --- Assignee: Apache Spark (was: Davies Liu) > sqlCtx -> sqlContext in pyspark shell > - > > Key: SPARK-6781 > URL: https://issues.apache.org/jira/browse/SPARK-6781 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Apache Spark >Priority: Critical > > We should be consistent across languages in the default names of things we > add to the shells. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6781) sqlCtx -> sqlContext in pyspark shell
[ https://issues.apache.org/jira/browse/SPARK-6781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6781: --- Assignee: Davies Liu (was: Apache Spark) > sqlCtx -> sqlContext in pyspark shell > - > > Key: SPARK-6781 > URL: https://issues.apache.org/jira/browse/SPARK-6781 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Davies Liu >Priority: Critical > > We should be consistent across languages in the default names of things we > add to the shells. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6781) sqlCtx -> sqlContext in pyspark shell
[ https://issues.apache.org/jira/browse/SPARK-6781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485803#comment-14485803 ] Apache Spark commented on SPARK-6781: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/5425 > sqlCtx -> sqlContext in pyspark shell > - > > Key: SPARK-6781 > URL: https://issues.apache.org/jira/browse/SPARK-6781 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Davies Liu >Priority: Critical > > We should be consistent across languages in the default names of things we > add to the shells. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6757) spark.sql.shuffle.partitions is global, not per connection
[ https://issues.apache.org/jira/browse/SPARK-6757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6757. - Resolution: Duplicate > spark.sql.shuffle.partitions is global, not per connection > -- > > Key: SPARK-6757 > URL: https://issues.apache.org/jira/browse/SPARK-6757 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0 >Reporter: David Ross > > We are trying to use the {{spark.sql.shuffle.partitions}} parameter to handle > large queries differently from smaller queries. We expected that this > parameter would be respected per connection, but it seems to be global. > For example, in try this in two separate JDBC connections: > Connection 1: > {code} > SET spark.sql.shuffle.partitions=10; > SELECT * FROM some_table; > {code} > The correct number {{10}} was used. > Connection 2: > {code} > SET spark.sql.shuffle.partitions=100; > SELECT * FROM some_table; > {code} > The correct number {{100}} was used. > Back to connection 1: > {code} > SELECT * FROM some_table; > {code} > We expected the number {{10}} to be used but {{100}} is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)
[ https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-5594: -- Description: I am uncertain whether this is a bug, however I am getting the error below when running on a cluster (works locally), and have no idea what is causing it, or where to look for more information. Any help is appreciated. Others appear to experience the same issue, but I have not found any solutions online. Please note that this only happens with certain code and is repeatable, all my other spark jobs work fine. {noformat} ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of broadcast_6 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of broadcast_6 at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008) ... 11 more {noformat} Driver stacktrace: {noformat} at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.fork
[jira] [Created] (SPARK-6782) add sbt-revolver plugin to sbt build
Imran Rashid created SPARK-6782: --- Summary: add sbt-revolver plugin to sbt build Key: SPARK-6782 URL: https://issues.apache.org/jira/browse/SPARK-6782 Project: Spark Issue Type: Improvement Components: Build Reporter: Imran Rashid Assignee: Imran Rashid Priority: Minor [sbt-revolver|https://github.com/spray/sbt-revolver] is a very useful sbt plugin for development. You can start & stop long-running processes without being forced to kill the entire sbt session. This can save a lot of time in the development cycle. With sbt-revolver, you run {{re-start}} to start your app in a forked jvm. It immediately gives you the sbt shell back, so you can continue to code. When you want to reload your app with whatever changes you make, you just run {{re-start}} again -- it will kill the forked jvm, recompile your code, and start the process again. (Or you can run {{re-stop}} at any time to kill the forked jvm.) I used this a ton while working on adding json support to the UI in https://issues.apache.org/jira/browse/SPARK-3454 (as the history server never stops, without this plugin I had to kill sbt between every time I'd run it manually to play with the behavior.) I don't write a lot of spark-streaming jobs, but I've also used this plugin in that case, since again my streaming jobs never terminate -- I imagine it would be really useful to anybody that is modifying streaming and wants to test out running some jobs. I'll post a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6705) MLLIB ML Pipeline's Logistic Regression has no intercept term
[ https://issues.apache.org/jira/browse/SPARK-6705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6705: - Assignee: Omede Firouz > MLLIB ML Pipeline's Logistic Regression has no intercept term > - > > Key: SPARK-6705 > URL: https://issues.apache.org/jira/browse/SPARK-6705 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Omede Firouz >Assignee: Omede Firouz > Fix For: 1.4.0 > > > Currently, the ML Pipeline's LogisticRegression.scala file does not allow > setting whether or not to fit an intercept term. Therefore, the pipeline > defers to LogisticRegressionWithLBFGS which does not use an intercept term. > This makes sense from a performance point of view because adding an intercept > term requires memory allocation. > However, this is undesirable statistically, since the statistical default is > usually to include an intercept term, and one needs to have a very strong > reason for not having an intercept term. > Explicitly modeling the intercept by adding a column of all 1s does not > work because LogisticRegressionWithLBFGS forces column normalization, and a > column of all 1s has 0 variance and so dividing by 0 kills it. > We should open up the API for the ML Pipeline to explicitly allow controlling > whether or not to fit an intercept. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6782) add sbt-revolver plugin to sbt build
[ https://issues.apache.org/jira/browse/SPARK-6782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6782: --- Assignee: Apache Spark (was: Imran Rashid) > add sbt-revolver plugin to sbt build > > > Key: SPARK-6782 > URL: https://issues.apache.org/jira/browse/SPARK-6782 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Minor > > [sbt-revolver|https://github.com/spray/sbt-revolver] is a very useful sbt > plugin for development. You can start & stop long-running processes without > being forced to kill the entire sbt session. This can save a lot of time in > the development cycle. > With sbt-revolver, you run {{re-start}} to start your app in a forked jvm. > It immediately gives you the sbt shell back, so you can continue to code. > When you want to reload your app with whatever changes you make, you just run > {{re-start}} again -- it will kill the forked jvm, recompile your code, and > start the process again. (Or you can run {{re-stop}} at any time to kill the > forked jvm.) > I used this a ton while working on adding json support to the UI in > https://issues.apache.org/jira/browse/SPARK-3454 (as the history server never > stops, without this plugin I had to kill sbt between every time I'd run it > manually to play with the behavior.) I don't write a lot of spark-streaming > jobs, but I've also used this plugin in that case, since again my streaming > jobs never terminate -- I imagine it would be really useful to anybody that > is modifying streaming and wants to test out running some jobs. > I'll post a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6782) add sbt-revolver plugin to sbt build
[ https://issues.apache.org/jira/browse/SPARK-6782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6782: --- Assignee: Imran Rashid (was: Apache Spark) > add sbt-revolver plugin to sbt build > > > Key: SPARK-6782 > URL: https://issues.apache.org/jira/browse/SPARK-6782 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > > [sbt-revolver|https://github.com/spray/sbt-revolver] is a very useful sbt > plugin for development. You can start & stop long-running processes without > being forced to kill the entire sbt session. This can save a lot of time in > the development cycle. > With sbt-revolver, you run {{re-start}} to start your app in a forked jvm. > It immediately gives you the sbt shell back, so you can continue to code. > When you want to reload your app with whatever changes you make, you just run > {{re-start}} again -- it will kill the forked jvm, recompile your code, and > start the process again. (Or you can run {{re-stop}} at any time to kill the > forked jvm.) > I used this a ton while working on adding json support to the UI in > https://issues.apache.org/jira/browse/SPARK-3454 (as the history server never > stops, without this plugin I had to kill sbt between every time I'd run it > manually to play with the behavior.) I don't write a lot of spark-streaming > jobs, but I've also used this plugin in that case, since again my streaming > jobs never terminate -- I imagine it would be really useful to anybody that > is modifying streaming and wants to test out running some jobs. > I'll post a PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org