[jira] [Commented] (SPARK-18941) Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the directory associated with the Hive table (not EXTERNAL table) from the HDFS file system
[ https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15772224#comment-15772224 ] luat commented on SPARK-18941: -- Hi [~dongjoon], It is ok. But I think that this difference should be documented. For some historical reasons, we're using 'create table ... location' for all our ETL jobs to create Hive table (not External table). I saw that, it still works well on Spark 1.6.3 (the directory associated with the hive table will be deleted when we drop table). We'll consider changing our ETL jobs to work correctly on Spark 2.0, too. Thank for your support. > Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the > directory associated with the Hive table (not EXTERNAL table) from the HDFS > file system > - > > Key: SPARK-18941 > URL: https://issues.apache.org/jira/browse/SPARK-18941 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.2 >Reporter: luat > > Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the > directory associated with the Hive table (not EXTERNAL table) from the HDFS > file system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18987) I am trying to disable Spark Stage progress logs on cluster.
vidit Singh created SPARK-18987: --- Summary: I am trying to disable Spark Stage progress logs on cluster. Key: SPARK-18987 URL: https://issues.apache.org/jira/browse/SPARK-18987 Project: Spark Issue Type: Question Components: Input/Output Environment: Hadoop Cluster Reporter: vidit Singh I am trying to disable Stage progress of spark on Cluster. As i don't want to see those stages. I have tried every possible things to disable that but still that problem persist. I have tried the following things : Logger.getLogger("org").setLevel(Level.OFF); Logger.getLogger("akka").setLevel(Level.OFF); Even modified log4j to log4j.logger.org.apache.spark=OFF I have tried passing spark.ui.showConsoleProgress=false in Spark Submit query. even passing external log4j along with spark submit didn't work for me. : --files log4j.properties . Output : [Stage 8:>(0 + 2) / 2][Stage 9:>(0 + 2) / 2][Stage 11:> (0 + 2) / 2] [Stage 8:>(0 + 2) / 2][Stage 9:>(0 + 2) / 2][Stage 13:> (0 + 2) / 2] [Stage 8:>(0 + 2) / 2][Stage 13:> (0 + 2) / 2][Stage 15:> (0 + 2) / 2] [Stage 8:>(0 + 2) / 2][Stage 15:> (0 + 2) / 2][Stage 20:==> (1 + 1) / 2] [Stage 8:> (0 + 2) / 2][Stage 15:> (0 + 2) / 2] [Stage 8:=> (1 + 1) / 2][Stage 15:=>(1 + 1) / 2] [Stage 15:=>(1 + 1) / 2] [Stage 10:> (0 + 180) / 200] [Stage 10:> (0 + 180) / 200][Stage 17:> (0 + 0) / 200] [Stage 10:> (2 + 180) / 200][Stage 17:> (0 + 0) / 200] [Stage 10:> (5 + 180) / 200][Stage 17:> (0 + 0) / 200] [Stage 10:>(14 + 180) / 200][Stage 17:> (0 + 0) / 200] I don't want to see these stages in my logs. Please help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18974) FileInputDStream could not detected files which moved to the directory
[ https://issues.apache.org/jira/browse/SPARK-18974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Wang updated SPARK-18974: -- Description: FileInputDStream use mod time to find new files, but if a file was moved into the directories it's modification time would not be changed, so FileInputDStream could not detect these files. I think a way to fix this bug is get access_time and do judgment, bug it need a Set of files to save all old files, it would very inefficient for lot of files directory. was: FileInputDStream use mod time to find new files, but if a file was moved into the directories it's modification time would not be changed, so FileInputDStream could not detected these files. I think a way to fix this bug is get access_time and do judgment, bug it need a Set of files to save all old files, it would very inefficient for lot of files directory. > FileInputDStream could not detected files which moved to the directory > --- > > Key: SPARK-18974 > URL: https://issues.apache.org/jira/browse/SPARK-18974 > Project: Spark > Issue Type: Bug > Components: DStreams, Input/Output >Affects Versions: 1.6.3, 2.0.2 >Reporter: Adam Wang > > FileInputDStream use mod time to find new files, but if a file was moved into > the directories it's modification time would not be changed, so > FileInputDStream could not detect these files. > I think a way to fix this bug is get access_time and do judgment, bug it need > a Set of files to save all old files, it would very inefficient for lot of > files directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18978) Spark streaming ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-18978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15772185#comment-15772185 ] Liang-Chi Hsieh commented on SPARK-18978: - Actually I can't reproduce this in master branch with spark-shell. Do you use specified spark version? > Spark streaming ClassCastException > -- > > Key: SPARK-18978 > URL: https://issues.apache.org/jira/browse/SPARK-18978 > Project: Spark > Issue Type: Bug >Reporter: Keltoum BOUQOUROU > > I use Spark Streaming as a listener to monitor a directory. When a new file > is detected, the program performs a processing on the file. The program is > the following: > val conf = new SparkConf().setAppName("DocumentRanking").setMaster("local[*]") > val sparkStreamingContext = new StreamingContext(conf, Seconds(5)) > val directoryStream =sparkStreamingContext.textFileStream("nom_dossier") > directoryStream.foreachRDD (rdd => if (rdd.count()!=0) > rdd.foreach(line =>traitement())) > The processing of the first file added passes without problems. But with the > addition of the second file, I have the following error: > 16/12/22 09:11:37 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) > java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration > cannot be cast to [B > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18974) FileInputDStream could not detected files which moved to the directory
[ https://issues.apache.org/jira/browse/SPARK-18974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam Wang updated SPARK-18974: -- Description: FileInputDStream use mod time to find new files, but if a file was moved into the directories it's modification time would not be changed, so FileInputDStream could not detected these files. I think a way to fix this bug is get access_time and do judgment, bug it need a Set of files to save all old files, it would very inefficient for lot of files directory. was: FileInputDStream use mod time to find new files, but if a file was moved into the directories it's modification time would not changed, so FileInputDStream could not detected these files. I think a way to fix this bug is get access_time and do judgment, bug it need a Set of files to save all old files, it would very inefficient for lot of files directory. > FileInputDStream could not detected files which moved to the directory > --- > > Key: SPARK-18974 > URL: https://issues.apache.org/jira/browse/SPARK-18974 > Project: Spark > Issue Type: Bug > Components: DStreams, Input/Output >Affects Versions: 1.6.3, 2.0.2 >Reporter: Adam Wang > > FileInputDStream use mod time to find new files, but if a file was moved into > the directories it's modification time would not be changed, so > FileInputDStream could not detected these files. > I think a way to fix this bug is get access_time and do judgment, bug it need > a Set of files to save all old files, it would very inefficient for lot of > files directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18986) ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator
[ https://issues.apache.org/jira/browse/SPARK-18986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18986: Assignee: Apache Spark > ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its > iterator > - > > Key: SPARK-18986 > URL: https://issues.apache.org/jira/browse/SPARK-18986 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > {{ExternalAppendOnlyMap.forceSpill}} now uses an assert to check if an > iterator is not null in the map. However, the assertion is only true after > the map is asked for iterator. Before it, if another memory consumer asks > more memory than currently available, {{ExternalAppendOnlyMap.forceSpill}} is > also be called too. In this case, we will see failure like this: > {code} > [info] java.lang.AssertionError: assertion failed > [info] at scala.Predef$.assert(Predef.scala:156) > [info] at > org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196) > [info] at > org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111) > [info] at > org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnly > MapSuite.scala:294) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18986) ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator
[ https://issues.apache.org/jira/browse/SPARK-18986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15772013#comment-15772013 ] Apache Spark commented on SPARK-18986: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/16387 > ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its > iterator > - > > Key: SPARK-18986 > URL: https://issues.apache.org/jira/browse/SPARK-18986 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Liang-Chi Hsieh > > {{ExternalAppendOnlyMap.forceSpill}} now uses an assert to check if an > iterator is not null in the map. However, the assertion is only true after > the map is asked for iterator. Before it, if another memory consumer asks > more memory than currently available, {{ExternalAppendOnlyMap.forceSpill}} is > also be called too. In this case, we will see failure like this: > {code} > [info] java.lang.AssertionError: assertion failed > [info] at scala.Predef$.assert(Predef.scala:156) > [info] at > org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196) > [info] at > org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111) > [info] at > org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnly > MapSuite.scala:294) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18199) Support appending to Parquet files
[ https://issues.apache.org/jira/browse/SPARK-18199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15772014#comment-15772014 ] Takeshi Yamamuro commented on SPARK-18199: -- Have you check this https://github.com/apache/spark/pull/16281? The parquet community will release v1.8.2 for backports and spark's planning to upgrade to v1.8.2. So, IIUC spark has no plan to upgrade to v1.9.0 for now. > Support appending to Parquet files > -- > > Key: SPARK-18199 > URL: https://issues.apache.org/jira/browse/SPARK-18199 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jeremy Smith > > Currently, appending to a Parquet directory involves simply creating new > parquet files in the directory. With many small appends (for example, in a > streaming job with a short batch duration) this leads to an unbounded number > of small Parquet files accumulating. These must be cleaned up with some > frequency by removing them all and rewriting a new file containing all the > rows. > It would be far better if Spark supported appending to the Parquet files > themselves. HDFS supports this, as does Parquet: > * The Parquet footer can be read in order to obtain necessary metadata. > * The new rows can then be appended to the Parquet file as a row group. > * A new footer can then be appended containing the metadata and referencing > the new row groups as well as the previously existing row groups. > This would result in a small amount of bloat in the file as new row groups > are added (since duplicate metadata would accumulate) but it's hugely > preferable to accumulating small files, which is bad for HDFS health and also > eventually leads to Spark being unable to read the Parquet directory at all. > Periodic rewriting of the file could still be performed in order to remove > the duplicate metadata. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18986) ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator
[ https://issues.apache.org/jira/browse/SPARK-18986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18986: Assignee: (was: Apache Spark) > ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its > iterator > - > > Key: SPARK-18986 > URL: https://issues.apache.org/jira/browse/SPARK-18986 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Liang-Chi Hsieh > > {{ExternalAppendOnlyMap.forceSpill}} now uses an assert to check if an > iterator is not null in the map. However, the assertion is only true after > the map is asked for iterator. Before it, if another memory consumer asks > more memory than currently available, {{ExternalAppendOnlyMap.forceSpill}} is > also be called too. In this case, we will see failure like this: > {code} > [info] java.lang.AssertionError: assertion failed > [info] at scala.Predef$.assert(Predef.scala:156) > [info] at > org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196) > [info] at > org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111) > [info] at > org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnly > MapSuite.scala:294) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18986) ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator
[ https://issues.apache.org/jira/browse/SPARK-18986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh updated SPARK-18986: Component/s: (was: SQL) Spark Core > ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its > iterator > - > > Key: SPARK-18986 > URL: https://issues.apache.org/jira/browse/SPARK-18986 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Liang-Chi Hsieh > > {{ExternalAppendOnlyMap.forceSpill}} now uses an assert to check if an > iterator is not null in the map. However, the assertion is only true after > the map is asked for iterator. Before it, if another memory consumer asks > more memory than currently available, {{ExternalAppendOnlyMap.forceSpill}} is > also be called too. In this case, we will see failure like this: > {code} > [info] java.lang.AssertionError: assertion failed > [info] at scala.Predef$.assert(Predef.scala:156) > [info] at > org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196) > [info] at > org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111) > [info] at > org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnly > MapSuite.scala:294) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18986) ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator
Liang-Chi Hsieh created SPARK-18986: --- Summary: ExternalAppendOnlyMap shouldn't fail when forced to spill before calling its iterator Key: SPARK-18986 URL: https://issues.apache.org/jira/browse/SPARK-18986 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh {{ExternalAppendOnlyMap.forceSpill}} now uses an assert to check if an iterator is not null in the map. However, the assertion is only true after the map is asked for iterator. Before it, if another memory consumer asks more memory than currently available, {{ExternalAppendOnlyMap.forceSpill}} is also be called too. In this case, we will see failure like this: {code} [info] java.lang.AssertionError: assertion failed [info] at scala.Predef$.assert(Predef.scala:156) [info] at org.apache.spark.util.collection.ExternalAppendOnlyMap.forceSpill(ExternalAppendOnlyMap.scala:196) [info] at org.apache.spark.util.collection.Spillable.spill(Spillable.scala:111) [info] at org.apache.spark.util.collection.ExternalAppendOnlyMapSuite$$anonfun$13.apply$mcV$sp(ExternalAppendOnly MapSuite.scala:294) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata
[ https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771773#comment-15771773 ] David Rosenstrauch commented on SPARK-9686: --- Apologies, this actually wasn't the issue I was having. Apologies for the noise. > Spark Thrift server doesn't return correct JDBC metadata > - > > Key: SPARK-9686 > URL: https://issues.apache.org/jira/browse/SPARK-9686 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2 >Reporter: pin_zhang >Assignee: Cheng Lian >Priority: Critical > Attachments: SPARK-9686.1.patch.txt > > > 1. Start start-thriftserver.sh > 2. connect with beeline > 3. create table > 4.show tables, the new created table returned > 5. > Class.forName("org.apache.hive.jdbc.HiveDriver"); > String URL = "jdbc:hive2://localhost:1/default"; >Properties info = new Properties(); > Connection conn = DriverManager.getConnection(URL, info); > ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(), >null, null, null); > Problem: >No tables with returned this API, that work in spark1.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18981) The last job hung when speculation is on
[ https://issues.apache.org/jira/browse/SPARK-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roncenzhao updated SPARK-18981: --- Description: CONF: spark.speculation true spark.dynamicAllocation.minExecutors0 spark.executor.cores 2 When I run the follow app, the bug will trigger. ``` sc.runJob(job1) sleep(100s) sc.runJob(job2) // the job2 will hang and never be scheduled ``` The triggering condition is described as follows: condition1: During the sleeping time, the executors will be released and the # of the executor will be zero some seconds later. The #numExecutorsTarget in 'ExecutorAllocationManager' will be 0. condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks will be negative during the ending of job1's tasks. condition3: The job2 only hava one task. result: In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or negative. So the 'ExecutorAllocationManager' will not request container from yarn. The app will hang. In the attachments, submitting the app by 'run_scala.sh' will lead to the 'hang' problem as the 'job_hang.png' shows. was: CONF: spark.speculation true spark.dynamicAllocation.minExecutors0 spark.executor.cores 4 When I run the follow app, the bug will trigger. ``` sc.runJob(job1) sleep(100s) sc.runJob(job2) // the job2 will hang and never be scheduled ``` The triggering condition is described as follows: condition1: During the sleeping time, the executors will be released and the # of the executor will be zero some seconds later. The #numExecutorsTarget in 'ExecutorAllocationManager' will be 0. condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks will be negative during the ending of job1's tasks. condition3: The job2 only hava one task. result: In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or negative. So the 'ExecutorAllocationManager' will not request container from yarn. The app will hang. In the attachments, submitting the app by 'run_scala.sh' will lead to the 'hang' problem as the 'job_hang.png' shows. > The last job hung when speculation is on > > > Key: SPARK-18981 > URL: https://issues.apache.org/jira/browse/SPARK-18981 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: spark2.0.2 > hadoop2.5.0 >Reporter: roncenzhao >Priority: Critical > Attachments: Test.scala, job_hang.png, run_scala.sh > > > CONF: > spark.speculation true > spark.dynamicAllocation.minExecutors0 > spark.executor.cores 2 > When I run the follow app, the bug will trigger. > ``` > sc.runJob(job1) > sleep(100s) > sc.runJob(job2) // the job2 will hang and never be scheduled > ``` > The triggering condition is described as follows: > condition1: During the sleeping time, the executors will be released and the > # of the executor will be zero some seconds later. The #numExecutorsTarget in > 'ExecutorAllocationManager' will be 0. > condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks > will be negative during the ending of job1's tasks. > condition3: The job2 only hava one task. > result: > In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', > we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, > #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or > negative. So the 'ExecutorAllocationManager' will not request container from > yarn. The app will hang. > In the attachments, submitting the app by 'run_scala.sh' will lead to the > 'hang' problem as the 'job_hang.png' shows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18981) The last job hung when speculation is on
[ https://issues.apache.org/jira/browse/SPARK-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roncenzhao updated SPARK-18981: --- Description: CONF: spark.speculation true spark.dynamicAllocation.minExecutors0 spark.executor.cores 4 When I run the follow app, the bug will trigger. ``` sc.runJob(job1) sleep(100s) sc.runJob(job2) // the job2 will hang and never be scheduled ``` The triggering condition is described as follows: condition1: During the sleeping time, the executors will be released and the # of the executor will be zero some seconds later. The #numExecutorsTarget in 'ExecutorAllocationManager' will be 0. condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks will be negative during the ending of job1's tasks. condition3: The job2 only hava one task. result: In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or negative. So the 'ExecutorAllocationManager' will not request container from yarn. The app will hang. In the attachments, submitting the app by 'run_scala.sh' will lead to the 'hang' problem as the 'job_hang.png' shows. was: CONF: spark.speculation true spark.dynamicAllocation.minExecutors0 spark.executor.cores 4 When I run the follow app, the bug will trigger. ``` sc.runJob(job1) sleep(100s) sc.runJob(job2) // the job2 will hang and never be scheduled ``` The triggering condition is described as follows: condition1: During the sleeping time, the executors will be released and the # of the executor will be zero some seconds later. The #numExecutorsTarget in 'ExecutorAllocationManager' will be 0. condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks will be negative during the ending of job1's tasks. condition3: The job2 only hava one task. result: In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or negative. So the 'ExecutorAllocationManager' will not request container from yarn. The app will hang. > The last job hung when speculation is on > > > Key: SPARK-18981 > URL: https://issues.apache.org/jira/browse/SPARK-18981 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: spark2.0.2 > hadoop2.5.0 >Reporter: roncenzhao >Priority: Critical > Attachments: Test.scala, job_hang.png, run_scala.sh > > > CONF: > spark.speculation true > spark.dynamicAllocation.minExecutors0 > spark.executor.cores 4 > When I run the follow app, the bug will trigger. > ``` > sc.runJob(job1) > sleep(100s) > sc.runJob(job2) // the job2 will hang and never be scheduled > ``` > The triggering condition is described as follows: > condition1: During the sleeping time, the executors will be released and the > # of the executor will be zero some seconds later. The #numExecutorsTarget in > 'ExecutorAllocationManager' will be 0. > condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks > will be negative during the ending of job1's tasks. > condition3: The job2 only hava one task. > result: > In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', > we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, > #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or > negative. So the 'ExecutorAllocationManager' will not request container from > yarn. The app will hang. > In the attachments, submitting the app by 'run_scala.sh' will lead to the > 'hang' problem as the 'job_hang.png' shows. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18981) The last job hung when speculation is on
[ https://issues.apache.org/jira/browse/SPARK-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roncenzhao updated SPARK-18981: --- Attachment: run_scala.sh > The last job hung when speculation is on > > > Key: SPARK-18981 > URL: https://issues.apache.org/jira/browse/SPARK-18981 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: spark2.0.2 > hadoop2.5.0 >Reporter: roncenzhao >Priority: Critical > Attachments: Test.scala, job_hang.png, run_scala.sh > > > CONF: > spark.speculation true > spark.dynamicAllocation.minExecutors0 > spark.executor.cores 4 > When I run the follow app, the bug will trigger. > ``` > sc.runJob(job1) > sleep(100s) > sc.runJob(job2) // the job2 will hang and never be scheduled > ``` > The triggering condition is described as follows: > condition1: During the sleeping time, the executors will be released and the > # of the executor will be zero some seconds later. The #numExecutorsTarget in > 'ExecutorAllocationManager' will be 0. > condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks > will be negative during the ending of job1's tasks. > condition3: The job2 only hava one task. > result: > In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', > we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, > #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or > negative. So the 'ExecutorAllocationManager' will not request container from > yarn. The app will hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18984) Concat with ds.write.text() throw exception if column contains null data
[ https://issues.apache.org/jira/browse/SPARK-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771706#comment-15771706 ] Hyukjin Kwon commented on SPARK-18984: -- [~tonythor], it seems you could reproduce this in Spark 2.0.0 not 2.0.2 because the line number seems a bit different with 2.0.2. It seems the problematic code is the same line in Spark 2.0.0 - https://github.com/apache/spark/blob/13650fc58e1fcf2cf2a26ba11c819185ae1acc1f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L146 in your case > Concat with ds.write.text() throw exception if column contains null data > > > Key: SPARK-18984 > URL: https://issues.apache.org/jira/browse/SPARK-18984 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: spark2.02 scala 2.11.8 >Reporter: Tony Fraser > > val customOutputFormat = outbound.select(concat( > outbound.col("device_id"), lit ("\t"), > lit("\"device\"=\""), col("device"),lit("\","), > lit("\"_fd_cast\"=\""), col("_fd_cast"), lit("\",") > ).where("device_type='ios_app' and _fd_cast is not null") > customOutputFormat > .limit(1000) > .write > .option("nullValue", "NULL") > .mode("overwrite") > .text("/filepath") > There is no problem writing to JSON, CSV or Parquet. And above code works. As > soon as you take out "and _fd_cast is not null" though it throws the > exception below. And using either nullValue either treatEmptyValuesAsNulls > either reading in or writing out doesn't seem to matter. > Exception is: > 16/12/22 14:16:18 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches > in 5 ms > 16/12/22 14:16:18 INFO DefaultWriterContainer: Using output committer class > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter > 16/12/22 14:16:18 ERROR Utils: Aborting task > java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:146) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/12/22 14:16:18 ERROR DefaultWriterContainer: Task attempt > attempt_201612221416_0002_m_00_0 aborted. > 16/12/22 14:16:18 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at >
[jira] [Updated] (SPARK-18981) The last job hung when speculation is on
[ https://issues.apache.org/jira/browse/SPARK-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roncenzhao updated SPARK-18981: --- Attachment: Test.scala > The last job hung when speculation is on > > > Key: SPARK-18981 > URL: https://issues.apache.org/jira/browse/SPARK-18981 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: spark2.0.2 > hadoop2.5.0 >Reporter: roncenzhao >Priority: Critical > Attachments: Test.scala, job_hang.png > > > CONF: > spark.speculation true > spark.dynamicAllocation.minExecutors0 > spark.executor.cores 4 > When I run the follow app, the bug will trigger. > ``` > sc.runJob(job1) > sleep(100s) > sc.runJob(job2) // the job2 will hang and never be scheduled > ``` > The triggering condition is described as follows: > condition1: During the sleeping time, the executors will be released and the > # of the executor will be zero some seconds later. The #numExecutorsTarget in > 'ExecutorAllocationManager' will be 0. > condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks > will be negative during the ending of job1's tasks. > condition3: The job2 only hava one task. > result: > In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', > we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, > #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or > negative. So the 'ExecutorAllocationManager' will not request container from > yarn. The app will hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18984) Concat with ds.write.text() throw exception if column contains null data
[ https://issues.apache.org/jira/browse/SPARK-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771695#comment-15771695 ] Hyukjin Kwon edited comment on SPARK-18984 at 12/23/16 2:50 AM: Yes, I can reproduce the same error with the codes as in Spark 2.0.2 below: {code} scala> Seq(Some("a"), None).toDF.write.text("/tmp/abc") java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:148) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) {code} What we could do with this JIRA might be one (or some) of these as below: - resolve it as this seems fixed in the master anyway. - add a test case for this because It seems there is no test case for this. - introduce {{nullValue}} option because It seems we can't read {{null}} back via Text datasource as below. {code} scala> Seq(Some("a"), None).toDF.show() +-+ |value| +-+ |a| | null| +-+ scala> spark.read.text("/tmp/abc").show() +-+ |value| +-+ |a| | | +-+ {code} was (Author: hyukjin.kwon): Yes, I can reproduce the same error with the codes as below: {code} scala> Seq(Some("a"), None).toDF.write.text("/tmp/abc") java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:148) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) {code} What we could do with this JIRA might be one (or some) of these as below: - resolve it as this seems fixed in the master anyway. - add a test case for this because It seems there is no test case for this. - introduce {{nullValue}} option because It seems we can't read {{null}} back via Text datasource as below. {code} scala> Seq(Some("a"), None).toDF.show() +-+ |value| +-+ |a| | null| +-+ scala> spark.read.text("/tmp/abc").show() +-+ |value| +-+ |a| | | +-+ {code} > Concat with ds.write.text() throw exception if column contains null data > > > Key: SPARK-18984 > URL: https://issues.apache.org/jira/browse/SPARK-18984 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: spark2.02 scala 2.11.8 >Reporter: Tony Fraser > > val customOutputFormat = outbound.select(concat( > outbound.col("device_id"), lit ("\t"), > lit("\"device\"=\""), col("device"),lit("\","), > lit("\"_fd_cast\"=\""), col("_fd_cast"), lit("\",") >
[jira] [Commented] (SPARK-18984) Concat with ds.write.text() throw exception if column contains null data
[ https://issues.apache.org/jira/browse/SPARK-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771695#comment-15771695 ] Hyukjin Kwon commented on SPARK-18984: -- Yes, I can reproduce the same error with the codes as below: {code} scala> Seq(Some("a"), None).toDF.write.text("/tmp/abc") java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:148) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1348) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) {code} What we could do with this JIRA might be one (or some) of these as below: - resolve it as this seems fixed in the master anyway. - add a test case for this because It seems there is no test case for this. - introduce {{nullValue}} option because It seems we can't read {{null}} back via Text datasource as below. {code} scala> Seq(Some("a"), None).toDF.show() +-+ |value| +-+ |a| | null| +-+ scala> spark.read.text("/tmp/abc").show() +-+ |value| +-+ |a| | | +-+ {code} > Concat with ds.write.text() throw exception if column contains null data > > > Key: SPARK-18984 > URL: https://issues.apache.org/jira/browse/SPARK-18984 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: spark2.02 scala 2.11.8 >Reporter: Tony Fraser > > val customOutputFormat = outbound.select(concat( > outbound.col("device_id"), lit ("\t"), > lit("\"device\"=\""), col("device"),lit("\","), > lit("\"_fd_cast\"=\""), col("_fd_cast"), lit("\",") > ).where("device_type='ios_app' and _fd_cast is not null") > customOutputFormat > .limit(1000) > .write > .option("nullValue", "NULL") > .mode("overwrite") > .text("/filepath") > There is no problem writing to JSON, CSV or Parquet. And above code works. As > soon as you take out "and _fd_cast is not null" though it throws the > exception below. And using either nullValue either treatEmptyValuesAsNulls > either reading in or writing out doesn't seem to matter. > Exception is: > 16/12/22 14:16:18 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches > in 5 ms > 16/12/22 14:16:18 INFO DefaultWriterContainer: Using output committer class > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter > 16/12/22 14:16:18 ERROR Utils: Aborting task > java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:146) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at
[jira] [Assigned] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18352: Assignee: (was: Apache Spark) > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18352: Assignee: Apache Spark > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > Labels: releasenotes > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18352) Parse normal, multi-line JSON files (not just JSON Lines)
[ https://issues.apache.org/jira/browse/SPARK-18352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771689#comment-15771689 ] Apache Spark commented on SPARK-18352: -- User 'NathanHowell' has created a pull request for this issue: https://github.com/apache/spark/pull/16386 > Parse normal, multi-line JSON files (not just JSON Lines) > - > > Key: SPARK-18352 > URL: https://issues.apache.org/jira/browse/SPARK-18352 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin > Labels: releasenotes > > Spark currently can only parse JSON files that are JSON lines, i.e. each > record has an entire line and records are separated by new line. In reality, > a lot of users want to use Spark to parse actual JSON files, and are > surprised to learn that it doesn't do that. > We can introduce a new mode (wholeJsonFile?) in which we don't split the > files, and rather stream through them to parse the JSON files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18984) Concat with ds.write.text() throw exception if column contains null data
[ https://issues.apache.org/jira/browse/SPARK-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771675#comment-15771675 ] Hyukjin Kwon edited comment on SPARK-18984 at 12/23/16 2:41 AM: It seems this is fixed together in SPARK-18658 in in master. Null check logic was introduced there - {{if (!row.isNullAt(0)) ..}} but this was only merged into master. The reproducible codes I tested was as below: {code} Seq(Some("a"), None).toDF.write.text("/tmp/abc") {code} Could you please confirm if this is fine in the master? BTW, I think {{nullValue}} is CSV datasource specific option. was (Author: hyukjin.kwon): It seems this is fixed together in SPARK-18658 in in master. Null check logic was introduced there - {{ if (!row.isNullAt(0)) ..}} but this was only merged into master. The reproducible codes I tested was as below: {code} Seq(Some("a"), None).toDF.write.text("/tmp/abc") {code} Could you please confirm if this is fine in the master? BTW, I think {{nullValue}} is CSV datasource specific option. > Concat with ds.write.text() throw exception if column contains null data > > > Key: SPARK-18984 > URL: https://issues.apache.org/jira/browse/SPARK-18984 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: spark2.02 scala 2.11.8 >Reporter: Tony Fraser > > val customOutputFormat = outbound.select(concat( > outbound.col("device_id"), lit ("\t"), > lit("\"device\"=\""), col("device"),lit("\","), > lit("\"_fd_cast\"=\""), col("_fd_cast"), lit("\",") > ).where("device_type='ios_app' and _fd_cast is not null") > customOutputFormat > .limit(1000) > .write > .option("nullValue", "NULL") > .mode("overwrite") > .text("/filepath") > There is no problem writing to JSON, CSV or Parquet. And above code works. As > soon as you take out "and _fd_cast is not null" though it throws the > exception below. And using either nullValue either treatEmptyValuesAsNulls > either reading in or writing out doesn't seem to matter. > Exception is: > 16/12/22 14:16:18 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches > in 5 ms > 16/12/22 14:16:18 INFO DefaultWriterContainer: Using output committer class > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter > 16/12/22 14:16:18 ERROR Utils: Aborting task > java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:146) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/12/22 14:16:18 ERROR DefaultWriterContainer: Task attempt > attempt_201612221416_0002_m_00_0 aborted. > 16/12/22 14:16:18 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at
[jira] [Commented] (SPARK-18984) Concat with ds.write.text() throw exception if column contains null data
[ https://issues.apache.org/jira/browse/SPARK-18984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771675#comment-15771675 ] Hyukjin Kwon commented on SPARK-18984: -- It seems this is fixed together in SPARK-18658 in in master. Null check logic was introduced there - {{ if (!row.isNullAt(0)) ..}} but this was only merged into master. The reproducible codes I tested was as below: {code} Seq(Some("a"), None).toDF.write.text("/tmp/abc") {code} Could you please confirm if this is fine in the master? BTW, I think {{nullValue}} is CSV datasource specific option. > Concat with ds.write.text() throw exception if column contains null data > > > Key: SPARK-18984 > URL: https://issues.apache.org/jira/browse/SPARK-18984 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 > Environment: spark2.02 scala 2.11.8 >Reporter: Tony Fraser > > val customOutputFormat = outbound.select(concat( > outbound.col("device_id"), lit ("\t"), > lit("\"device\"=\""), col("device"),lit("\","), > lit("\"_fd_cast\"=\""), col("_fd_cast"), lit("\",") > ).where("device_type='ios_app' and _fd_cast is not null") > customOutputFormat > .limit(1000) > .write > .option("nullValue", "NULL") > .mode("overwrite") > .text("/filepath") > There is no problem writing to JSON, CSV or Parquet. And above code works. As > soon as you take out "and _fd_cast is not null" though it throws the > exception below. And using either nullValue either treatEmptyValuesAsNulls > either reading in or writing out doesn't seem to matter. > Exception is: > 16/12/22 14:16:18 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches > in 5 ms > 16/12/22 14:16:18 INFO DefaultWriterContainer: Using output committer class > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter > 16/12/22 14:16:18 ERROR Utils: Aborting task > java.lang.NullPointerException > at > org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:146) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/12/22 14:16:18 ERROR DefaultWriterContainer: Task attempt > attempt_201612221416_0002_m_00_0 aborted. > 16/12/22 14:16:18 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at >
[jira] [Updated] (SPARK-18981) The last job hung when speculation is on
[ https://issues.apache.org/jira/browse/SPARK-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roncenzhao updated SPARK-18981: --- Attachment: job_hang.png > The last job hung when speculation is on > > > Key: SPARK-18981 > URL: https://issues.apache.org/jira/browse/SPARK-18981 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: spark2.0.2 > hadoop2.5.0 >Reporter: roncenzhao >Priority: Critical > Attachments: job_hang.png > > > CONF: > spark.speculation true > spark.dynamicAllocation.minExecutors0 > spark.executor.cores 4 > When I run the follow app, the bug will trigger. > ``` > sc.runJob(job1) > sleep(100s) > sc.runJob(job2) // the job2 will hang and never be scheduled > ``` > The triggering condition is described as follows: > condition1: During the sleeping time, the executors will be released and the > # of the executor will be zero some seconds later. The #numExecutorsTarget in > 'ExecutorAllocationManager' will be 0. > condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks > will be negative during the ending of job1's tasks. > condition3: The job2 only hava one task. > result: > In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', > we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, > #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or > negative. So the 'ExecutorAllocationManager' will not request container from > yarn. The app will hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18981) The last job hung when speculation is on
[ https://issues.apache.org/jira/browse/SPARK-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roncenzhao updated SPARK-18981: --- Description: CONF: spark.speculation true spark.dynamicAllocation.minExecutors0 spark.executor.cores 4 When I run the follow app, the bug will trigger. ``` sc.runJob(job1) sleep(100s) sc.runJob(job2) // the job2 will hang and never be scheduled ``` The triggering condition is described as follows: condition1: During the sleeping time, the executors will be released and the # of the executor will be zero some seconds later. The #numExecutorsTarget in 'ExecutorAllocationManager' will be 0. condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks will be negative during the ending of job1's tasks. condition3: The job2 only hava one task. result: In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or negative. So the 'ExecutorAllocationManager' will not request container from yarn. The app will hang. was: related settings: spark.speculation true spark.dynamicAllocation.minExecutors0 spark.executor.cores 4 When I run the follow app, the bug will trigger. ``` sc.runJob(job1) sleep(100s) sc.runJob(job2) // the job2 will hang and never be scheduled ``` The triggering condition is described as follows: condition1: During the sleeping time, the executors will be released and the # of the executor will be zero some seconds later. The #numExecutorsTarget in 'ExecutorAllocationManager' will be 0. condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks will be negative during the ending of job1's tasks. condition3: The job2 only hava one task. result: In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or negative. So the 'ExecutorAllocationManager' will not request container from yarn. The app will hang. > The last job hung when speculation is on > > > Key: SPARK-18981 > URL: https://issues.apache.org/jira/browse/SPARK-18981 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: spark2.0.2 > hadoop2.5.0 >Reporter: roncenzhao >Priority: Critical > > CONF: > spark.speculation true > spark.dynamicAllocation.minExecutors0 > spark.executor.cores 4 > When I run the follow app, the bug will trigger. > ``` > sc.runJob(job1) > sleep(100s) > sc.runJob(job2) // the job2 will hang and never be scheduled > ``` > The triggering condition is described as follows: > condition1: During the sleeping time, the executors will be released and the > # of the executor will be zero some seconds later. The #numExecutorsTarget in > 'ExecutorAllocationManager' will be 0. > condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks > will be negative during the ending of job1's tasks. > condition3: The job2 only hava one task. > result: > In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', > we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, > #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or > negative. So the 'ExecutorAllocationManager' will not request container from > yarn. The app will hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18924) Improve collect/createDataFrame performance in SparkR
[ https://issues.apache.org/jira/browse/SPARK-18924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771602#comment-15771602 ] Shivaram Venkataraman commented on SPARK-18924: --- Yeah I think we could convert this JIRA into a few sub-tasks - the first one could be profiling some of the existing code to get a breakdown of how much time is spent where. The next one could be the JVM side changes like boxing / unboxing improvements etc. > Improve collect/createDataFrame performance in SparkR > - > > Key: SPARK-18924 > URL: https://issues.apache.org/jira/browse/SPARK-18924 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Xiangrui Meng >Priority: Critical > > SparkR has its own SerDe for data serialization between JVM and R. > The SerDe on the JVM side is implemented in: > * > [SerDe.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/SerDe.scala] > * > [SQLUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala] > The SerDe on the R side is implemented in: > * > [deserialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/deserialize.R] > * > [serialize.R|https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R] > The serialization between JVM and R suffers from huge storage and computation > overhead. For example, a short round trip of 1 million doubles surprisingly > took 3 minutes on my laptop: > {code} > > system.time(collect(createDataFrame(data.frame(x=runif(100) >user system elapsed > 14.224 0.582 189.135 > {code} > Collecting a medium-sized DataFrame to local and continuing with a local R > workflow is a use case we should pay attention to. SparkR will never be able > to cover all existing features from CRAN packages. It is also unnecessary for > Spark to do so because not all features need scalability. > Several factors contribute to the serialization overhead: > 1. The SerDe in R side is implemented using high-level R methods. > 2. DataFrame columns are not efficiently serialized, primitive type columns > in particular. > 3. Some overhead in the serialization protocol/impl. > 1) might be discussed before because R packages like rJava exist before > SparkR. I'm not sure whether we have a license issue in depending on those > libraries. Another option is to switch to low-level R'C interface or Rcpp, > which again might have license issue. I'm not an expert here. If we have to > implement our own, there still exist much space for improvement, discussed > below. > 2) is a huge gap. The current collect is implemented by `SQLUtils.dfToCols`, > which collects rows to local and then constructs columns. However, > * it ignores column types and results boxing/unboxing overhead > * it collects all objects to driver and results high GC pressure > A relatively simple change is to implement specialized column builder based > on column types, primitive types in particular. We need to handle null/NA > values properly. A simple data structure we can use is > {code} > val size: Int > val nullIndexes: Array[Int] > val notNullValues: Array[T] // specialized for primitive types > {code} > On the R side, we can use `readBin` and `writeBin` to read the entire vector > in a single method call. The speed seems reasonable (at the order of GB/s): > {code} > > x <- runif(1000) # 1e7, not 1e6 > > system.time(r <- writeBin(x, raw(0))) >user system elapsed > 0.036 0.021 0.059 > > > system.time(y <- readBin(r, double(), 1000)) >user system elapsed > 0.015 0.007 0.024 > {code} > This is just a proposal that needs to be discussed and formalized. But in > general, it should be feasible to obtain 20x or more performance gain. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing
[ https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771583#comment-15771583 ] Lev commented on SPARK-18970: - Actually this is exactly the behavior I want. My problem is that application appeared to be alive, but was not processing and new files after this message in the log. I intentionally included the portion of the log in the bottom, showing that file lisd was refreshed every couple of minutes before. > FileSource failure during file list refresh doesn't cause an application to > fail, but stops further processing > -- > > Key: SPARK-18970 > URL: https://issues.apache.org/jira/browse/SPARK-18970 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.0.0, 2.0.2 >Reporter: Lev > Attachments: sparkerror.log > > > Spark streaming application uses S3 files as streaming sources. After running > for several day processing stopped even though an application continued to > run. > Stack trace: > java.io.FileNotFoundException: No such file or directory > 's3n://X' > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > I believe 2 things should (or can) be fixed: > 1. Application should fail in case of such an error. > 2. Allow application to ignore such failure, since there is a chance that > during next refresh the error will not resurface. (In my case I believe an > error was cased by S3 cleaning the bucket exactly at the same moment when > refresh was running) > My code to create streaming processing looks as the following: > val cq = sqlContext.readStream > .format("json") > .schema(struct) > .load(s"input") > .writeStream > .option("checkpointLocation", s"checkpoints") > .foreach(new ForeachWriter[Row] {...}) > .trigger(ProcessingTime("10 seconds")).start() > > cq.awaitTermination() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18805) InternalMapWithStateDStream make java.lang.StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-18805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771504#comment-15771504 ] Vladimir Pchelko commented on SPARK-18805: -- I had faced with similar problem ... there are two 'problems' with mapWithState: 1. spark.streaming.concurrentJobs 2. lack of memory with high GC time In both cases I noticed strange/magic errors. It seems in your case - application is unrecoverable due lack of memory. > InternalMapWithStateDStream make java.lang.StackOverflowError > -- > > Key: SPARK-18805 > URL: https://issues.apache.org/jira/browse/SPARK-18805 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.3, 2.0.2 > Environment: mesos >Reporter: etienne > > When load InternalMapWithStateDStream from a check point. > If isValidTime is true and if there is no generatedRDD at the given time > there is an infinite loop. > 1) compute is call on InternalMapWithStateDStream > 2) InternalMapWithStateDStream try to generate the previousRDD > 3) Stream look in generatedRDD if the RDD is already generated for the given > time > 4) It not fund the rdd so it check if the time is valid. > 5) if the time is valid call compute on InternalMapWithStateDStream > 6) restart from 1) > Here the exception that illustrate this error > {code} > Exception in thread "streaming-start" java.lang.StackOverflowError > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340) > at > org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330) > at > org.apache.spark.streaming.dstream.InternalMapWithStateDStream.compute(MapWithStateDStream.scala:134) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18972) Fix the netty thread names for RPC
[ https://issues.apache.org/jira/browse/SPARK-18972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-18972. -- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 2.0.3 > Fix the netty thread names for RPC > -- > > Key: SPARK-18972 > URL: https://issues.apache.org/jira/browse/SPARK-18972 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > Right now the name of threads created by Netty for Spark RPC are > `shuffle-client-***` and `shuffle-server-***`. We should fix the names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17807) Scalatest listed as compile dependency in spark-tags
[ https://issues.apache.org/jira/browse/SPARK-17807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-17807: --- Fix Version/s: 2.0.3 > Scalatest listed as compile dependency in spark-tags > > > Key: SPARK-17807 > URL: https://issues.apache.org/jira/browse/SPARK-17807 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Tom Standard >Priority: Trivial > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > In spark-tags:2.0.0, Scalatest is listed as a compile time dependency - > shouldn't this be in test scope? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18985) Add missing @InterfaceStability.Evolving for Structured Streaming APIs
[ https://issues.apache.org/jira/browse/SPARK-18985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-18985. -- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > Add missing @InterfaceStability.Evolving for Structured Streaming APIs > -- > > Key: SPARK-18985 > URL: https://issues.apache.org/jira/browse/SPARK-18985 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.1, 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17807) Scalatest listed as compile dependency in spark-tags
[ https://issues.apache.org/jira/browse/SPARK-17807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-17807. Resolution: Fixed Fix Version/s: 2.1.1 > Scalatest listed as compile dependency in spark-tags > > > Key: SPARK-17807 > URL: https://issues.apache.org/jira/browse/SPARK-17807 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Tom Standard >Priority: Trivial > Fix For: 2.1.1, 2.2.0 > > > In spark-tags:2.0.0, Scalatest is listed as a compile time dependency - > shouldn't this be in test scope? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18618) SparkR GLM model predict should support type as a argument
[ https://issues.apache.org/jira/browse/SPARK-18618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771417#comment-15771417 ] Joseph K. Bradley commented on SPARK-18618: --- Note that [~yanboliang]'s PR from [SPARK-18291] has a lot of the code required to fix this issue. > SparkR GLM model predict should support type as a argument > -- > > Key: SPARK-18618 > URL: https://issues.apache.org/jira/browse/SPARK-18618 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang > Labels: 2.2.0 > > SparkR GLM model {{predict}} should support {{type}} as a argument. This will > it consistent with native R predict such as > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18291) SparkR glm predict should output original label when family = "binomial"
[ https://issues.apache.org/jira/browse/SPARK-18291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-18291. - Resolution: Duplicate Target Version/s: (was: 2.2.0) I'm closing this since [SPARK-18618] will fix the issue. > SparkR glm predict should output original label when family = "binomial" > > > Key: SPARK-18291 > URL: https://issues.apache.org/jira/browse/SPARK-18291 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Attachments: SparkR2.1decisionoutputschemaforGLMs.pdf > > > SparkR spark.glm predict should output original label when family = > "binomial". > For example, we can run the following code in sparkr shell: > {code} > training <- suppressWarnings(createDataFrame(iris)) > training <- training[training$Species %in% c("versicolor", "virginica"), ] > model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = > binomial(link = "logit")) > showDF(predict(model, training)) > {code} > The prediction column is double value which makes no sense to users. > {code} > ++---++---+--+-+---+ > |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width| Species|label| > prediction| > ++---++---+--+-+---+ > | 7.0|3.2| 4.7|1.4|versicolor| 0.0| > 0.8271421517601544| > | 6.4|3.2| 4.5|1.5|versicolor| 0.0| > 0.6044595910413112| > | 6.9|3.1| 4.9|1.5|versicolor| 0.0| > 0.7916340858281998| > | 5.5|2.3| 4.0|1.3|versicolor| > 0.0|0.16080518180591158| > | 6.5|2.8| 4.6|1.5|versicolor| 0.0| > 0.6112229217050189| > | 5.7|2.8| 4.5|1.3|versicolor| 0.0| > 0.2555087295500885| > | 6.3|3.3| 4.7|1.6|versicolor| 0.0| > 0.5681507664364834| > | 4.9|2.4| 3.3|1.0|versicolor| > 0.0|0.05990570219972002| > | 6.6|2.9| 4.6|1.3|versicolor| 0.0| > 0.6644434078306246| > | 5.2|2.7| 3.9|1.4|versicolor| > 0.0|0.11293577405862379| > | 5.0|2.0| 3.5|1.0|versicolor| > 0.0|0.06152372321585971| > | 5.9|3.0| 4.2|1.5|versicolor| > 0.0|0.35250697207602555| > | 6.0|2.2| 4.0|1.0|versicolor| > 0.0|0.32267018290814303| > | 6.1|2.9| 4.7|1.4|versicolor| 0.0| > 0.433391153814592| > | 5.6|2.9| 3.6|1.3|versicolor| 0.0| > 0.2280744262436993| > | 6.7|3.1| 4.4|1.4|versicolor| 0.0| > 0.7219848389339459| > | 5.6|3.0| 4.5|1.5|versicolor| > 0.0|0.23527698971404695| > | 5.8|2.7| 4.1|1.0|versicolor| 0.0| > 0.285024533520016| > | 6.2|2.2| 4.5|1.5|versicolor| 0.0| > 0.4107047877447493| > | 5.6|2.5| 3.9|1.1|versicolor| > 0.0|0.20083561961645083| > ++---++---+--+-+---+ > {code} > The prediction value should be the original label like: > {code} > ++---++---+--+-+--+ > |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width| > Species|label|prediction| > ++---++---+--+-+--+ > | 7.0|3.2| 4.7|1.4|versicolor| 0.0| > virginica| > | 6.4|3.2| 4.5|1.5|versicolor| 0.0| > virginica| > | 6.9|3.1| 4.9|1.5|versicolor| 0.0| > virginica| > | 5.5|2.3| 4.0|1.3|versicolor| > 0.0|versicolor| > | 6.5|2.8| 4.6|1.5|versicolor| 0.0| > virginica| > | 5.7|2.8| 4.5|1.3|versicolor| > 0.0|versicolor| > | 6.3|3.3| 4.7|1.6|versicolor| 0.0| > virginica| > | 4.9|2.4| 3.3|1.0|versicolor| > 0.0|versicolor| > | 6.6|2.9| 4.6|1.3|versicolor| 0.0| > virginica| > | 5.2|2.7| 3.9|1.4|versicolor| > 0.0|versicolor| > | 5.0|2.0| 3.5|1.0|versicolor| > 0.0|versicolor| > | 5.9|3.0| 4.2|1.5|versicolor| > 0.0|versicolor| > | 6.0|2.2| 4.0|1.0|versicolor| >
[jira] [Updated] (SPARK-18618) SparkR GLM model predict should support type as a argument
[ https://issues.apache.org/jira/browse/SPARK-18618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18618: -- Description: SparkR GLM model {{predict}} should support {{type}} as a argument. This will it consistent with native R predict such as https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html . (was: SparkR model {{predict}} should support {{type}} as a argument. This will it consistent with native R predict such as https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html .) > SparkR GLM model predict should support type as a argument > -- > > Key: SPARK-18618 > URL: https://issues.apache.org/jira/browse/SPARK-18618 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang > Labels: 2.2.0 > > SparkR GLM model {{predict}} should support {{type}} as a argument. This will > it consistent with native R predict such as > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18618) SparkR GLM model predict should support type as a argument
[ https://issues.apache.org/jira/browse/SPARK-18618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18618: -- Summary: SparkR GLM model predict should support type as a argument (was: SparkR model predict should support type as a argument) > SparkR GLM model predict should support type as a argument > -- > > Key: SPARK-18618 > URL: https://issues.apache.org/jira/browse/SPARK-18618 > Project: Spark > Issue Type: Improvement > Components: ML, SparkR >Reporter: Yanbo Liang > Labels: 2.2.0 > > SparkR model {{predict}} should support {{type}} as a argument. This will it > consistent with native R predict such as > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.glm.html . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata
[ https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771364#comment-15771364 ] David Rosenstrauch commented on SPARK-9686: --- Ditto. Any closer to a fix? > Spark Thrift server doesn't return correct JDBC metadata > - > > Key: SPARK-9686 > URL: https://issues.apache.org/jira/browse/SPARK-9686 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2 >Reporter: pin_zhang >Assignee: Cheng Lian >Priority: Critical > Attachments: SPARK-9686.1.patch.txt > > > 1. Start start-thriftserver.sh > 2. connect with beeline > 3. create table > 4.show tables, the new created table returned > 5. > Class.forName("org.apache.hive.jdbc.HiveDriver"); > String URL = "jdbc:hive2://localhost:1/default"; >Properties info = new Properties(); > Connection conn = DriverManager.getConnection(URL, info); > ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(), >null, null, null); > Problem: >No tables with returned this API, that work in spark1.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing
[ https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771304#comment-15771304 ] Shixiong Zhu commented on SPARK-18970: -- I see. FileStreamSource ignores FileNotFoundException when trying to get a file status inside the input directory. This allows the user to clean up the old processed files without failing the query. Any reason you want to disable this behavior? > FileSource failure during file list refresh doesn't cause an application to > fail, but stops further processing > -- > > Key: SPARK-18970 > URL: https://issues.apache.org/jira/browse/SPARK-18970 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.0.0, 2.0.2 >Reporter: Lev > Attachments: sparkerror.log > > > Spark streaming application uses S3 files as streaming sources. After running > for several day processing stopped even though an application continued to > run. > Stack trace: > java.io.FileNotFoundException: No such file or directory > 's3n://X' > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > I believe 2 things should (or can) be fixed: > 1. Application should fail in case of such an error. > 2. Allow application to ignore such failure, since there is a chance that > during next refresh the error will not resurface. (In my case I believe an > error was cased by S3 cleaning the bucket exactly at the same moment when > refresh was running) > My code to create streaming processing looks as the following: > val cq = sqlContext.readStream > .format("json") > .schema(struct) > .load(s"input") > .writeStream > .option("checkpointLocation", s"checkpoints") > .foreach(new ForeachWriter[Row] {...}) > .trigger(ProcessingTime("10 seconds")).start() > > cq.awaitTermination() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing
[ https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lev updated SPARK-18970: Attachment: sparkerror.log Here is the log file > FileSource failure during file list refresh doesn't cause an application to > fail, but stops further processing > -- > > Key: SPARK-18970 > URL: https://issues.apache.org/jira/browse/SPARK-18970 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.0.0, 2.0.2 >Reporter: Lev > Attachments: sparkerror.log > > > Spark streaming application uses S3 files as streaming sources. After running > for several day processing stopped even though an application continued to > run. > Stack trace: > java.io.FileNotFoundException: No such file or directory > 's3n://X' > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > I believe 2 things should (or can) be fixed: > 1. Application should fail in case of such an error. > 2. Allow application to ignore such failure, since there is a chance that > during next refresh the error will not resurface. (In my case I believe an > error was cased by S3 cleaning the bucket exactly at the same moment when > refresh was running) > My code to create streaming processing looks as the following: > val cq = sqlContext.readStream > .format("json") > .schema(struct) > .load(s"input") > .writeStream > .option("checkpointLocation", s"checkpoints") > .foreach(new ForeachWriter[Row] {...}) > .trigger(ProcessingTime("10 seconds")).start() > > cq.awaitTermination() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing
[ https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771221#comment-15771221 ] Shixiong Zhu commented on SPARK-18970: -- Where did you find the exception? If it's in the driver, could you post the full stack trace? 2.1 RC5 passed, so it should be available very soon. > FileSource failure during file list refresh doesn't cause an application to > fail, but stops further processing > -- > > Key: SPARK-18970 > URL: https://issues.apache.org/jira/browse/SPARK-18970 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.0.0, 2.0.2 >Reporter: Lev > > Spark streaming application uses S3 files as streaming sources. After running > for several day processing stopped even though an application continued to > run. > Stack trace: > java.io.FileNotFoundException: No such file or directory > 's3n://X' > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > I believe 2 things should (or can) be fixed: > 1. Application should fail in case of such an error. > 2. Allow application to ignore such failure, since there is a chance that > during next refresh the error will not resurface. (In my case I believe an > error was cased by S3 cleaning the bucket exactly at the same moment when > refresh was running) > My code to create streaming processing looks as the following: > val cq = sqlContext.readStream > .format("json") > .schema(struct) > .load(s"input") > .writeStream > .option("checkpointLocation", s"checkpoints") > .foreach(new ForeachWriter[Row] {...}) > .trigger(ProcessingTime("10 seconds")).start() > > cq.awaitTermination() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing
[ https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771199#comment-15771199 ] Lev commented on SPARK-18970: - I am not sure whether task was retried or not, but Spark application never processed any new file after this error has occurred. It appears that this error is never passed to the application code itself and application doesn't have a chance to terminate itself. Perhaps I should mention that I am running several spark structured streams in my application. We'll wait for 2.1 release to test SPARK-18774. I couldn't find any release date for 2.1. Any idea when it may happen? > FileSource failure during file list refresh doesn't cause an application to > fail, but stops further processing > -- > > Key: SPARK-18970 > URL: https://issues.apache.org/jira/browse/SPARK-18970 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.0.0, 2.0.2 >Reporter: Lev > > Spark streaming application uses S3 files as streaming sources. After running > for several day processing stopped even though an application continued to > run. > Stack trace: > java.io.FileNotFoundException: No such file or directory > 's3n://X' > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > I believe 2 things should (or can) be fixed: > 1. Application should fail in case of such an error. > 2. Allow application to ignore such failure, since there is a chance that > during next refresh the error will not resurface. (In my case I believe an > error was cased by S3 cleaning the bucket exactly at the same moment when > refresh was running) > My code to create streaming processing looks as the following: > val cq = sqlContext.readStream > .format("json") > .schema(struct) > .load(s"input") > .writeStream > .option("checkpointLocation", s"checkpoints") > .foreach(new ForeachWriter[Row] {...}) > .trigger(ProcessingTime("10 seconds")).start() > > cq.awaitTermination() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17344) Kafka 0.8 support for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17344: - Target Version/s: (was: 2.1.1) > Kafka 0.8 support for Structured Streaming > -- > > Key: SPARK-17344 > URL: https://issues.apache.org/jira/browse/SPARK-17344 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Frederick Reiss > > Design and implement Kafka 0.8-based sources and sinks for Structured > Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18985) Add missing @InterfaceStability.Evolving for Structured Streaming APIs
[ https://issues.apache.org/jira/browse/SPARK-18985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771175#comment-15771175 ] Apache Spark commented on SPARK-18985: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16385 > Add missing @InterfaceStability.Evolving for Structured Streaming APIs > -- > > Key: SPARK-18985 > URL: https://issues.apache.org/jira/browse/SPARK-18985 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18985) Add missing @InterfaceStability.Evolving for Structured Streaming APIs
[ https://issues.apache.org/jira/browse/SPARK-18985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18985: Assignee: Shixiong Zhu (was: Apache Spark) > Add missing @InterfaceStability.Evolving for Structured Streaming APIs > -- > > Key: SPARK-18985 > URL: https://issues.apache.org/jira/browse/SPARK-18985 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18941) Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the directory associated with the Hive table (not EXTERNAL table) from the HDFS file system
[ https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771176#comment-15771176 ] Dongjoon Hyun commented on SPARK-18941: --- After investigating, I found that it was reported by [~andrewor14] and committed by [~yhuai] as an official issue SPARK-15276 at 2.0.0. So, although this is not well-documented except code, this is definitely intended behavior which is designed differently from Hive. Since the location is not under `spark-warehouse`, I agree with the current behavior. If there is no other reasons except Hive compatibility, it seems to be invalid to make a PR to fix this. Is it okay, [~luatnc]? cc [~smilegator] [~rxin] > Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the > directory associated with the Hive table (not EXTERNAL table) from the HDFS > file system > - > > Key: SPARK-18941 > URL: https://issues.apache.org/jira/browse/SPARK-18941 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.2 >Reporter: luat > > Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the > directory associated with the Hive table (not EXTERNAL table) from the HDFS > file system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18985) Add missing @InterfaceStability.Evolving for Structured Streaming APIs
[ https://issues.apache.org/jira/browse/SPARK-18985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18985: Assignee: Apache Spark (was: Shixiong Zhu) > Add missing @InterfaceStability.Evolving for Structured Streaming APIs > -- > > Key: SPARK-18985 > URL: https://issues.apache.org/jira/browse/SPARK-18985 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Shixiong Zhu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18985) Add missing @InterfaceStability.Evolving for Structured Streaming APIs
Shixiong Zhu created SPARK-18985: Summary: Add missing @InterfaceStability.Evolving for Structured Streaming APIs Key: SPARK-18985 URL: https://issues.apache.org/jira/browse/SPARK-18985 Project: Spark Issue Type: Improvement Components: Structured Streaming Reporter: Shixiong Zhu Assignee: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18537) Add a REST api to spark streaming
[ https://issues.apache.org/jira/browse/SPARK-18537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-18537: --- Assignee: Xing Shi > Add a REST api to spark streaming > - > > Key: SPARK-18537 > URL: https://issues.apache.org/jira/browse/SPARK-18537 > Project: Spark > Issue Type: New Feature > Components: Web UI >Reporter: Peter Chan >Assignee: Xing Shi > Fix For: 2.2.0 > > > trying to monitoring our streaming application using Spark REST interface > and found out that there is no api for streaming. > it let us no choice but to implement one for ourself. > this api should cover exceptly the same amount of information as you can get > from the web interface > the implementation is base on the current REST implementation of spark-core > and will be available for running applications only > here is how you can use it: > endpoint root: /streaming/api/v1 > || Endpoint || Meaning || > |/statistics|Statistics information of stream| > |/receivers|A list of all receiver streams| > |/receivers/\[stream-id\]|Details of the given receiver stream| > |/batches|A list of all retained batches| > |/batches/\[batch-id\]|Details of the given batch| > |/batches/\[batch-id\]/operations|A list of all output operations of the > given batch| > |/batches/\[batch-id\]/operations/\[operation-id\]|Details of the given > operation, given batch| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing
[ https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15771166#comment-15771166 ] Shixiong Zhu commented on SPARK-18970: -- Did the Spark task fail or not? Looks like the Spark task was retried and it succeeded. For ignoring such failure, it's done in SPARK-17850 and SPARK-18774. > FileSource failure during file list refresh doesn't cause an application to > fail, but stops further processing > -- > > Key: SPARK-18970 > URL: https://issues.apache.org/jira/browse/SPARK-18970 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.0.0, 2.0.2 >Reporter: Lev > > Spark streaming application uses S3 files as streaming sources. After running > for several day processing stopped even though an application continued to > run. > Stack trace: > java.io.FileNotFoundException: No such file or directory > 's3n://X' > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > I believe 2 things should (or can) be fixed: > 1. Application should fail in case of such an error. > 2. Allow application to ignore such failure, since there is a chance that > during next refresh the error will not resurface. (In my case I believe an > error was cased by S3 cleaning the bucket exactly at the same moment when > refresh was running) > My code to create streaming processing looks as the following: > val cq = sqlContext.readStream > .format("json") > .schema(struct) > .load(s"input") > .writeStream > .option("checkpointLocation", s"checkpoints") > .foreach(new ForeachWriter[Row] {...}) > .trigger(ProcessingTime("10 seconds")).start() > > cq.awaitTermination() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18537) Add a REST api to spark streaming
[ https://issues.apache.org/jira/browse/SPARK-18537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-18537. Resolution: Fixed Fix Version/s: 2.2.0 > Add a REST api to spark streaming > - > > Key: SPARK-18537 > URL: https://issues.apache.org/jira/browse/SPARK-18537 > Project: Spark > Issue Type: New Feature > Components: Web UI >Reporter: Peter Chan > Fix For: 2.2.0 > > > trying to monitoring our streaming application using Spark REST interface > and found out that there is no api for streaming. > it let us no choice but to implement one for ourself. > this api should cover exceptly the same amount of information as you can get > from the web interface > the implementation is base on the current REST implementation of spark-core > and will be available for running applications only > here is how you can use it: > endpoint root: /streaming/api/v1 > || Endpoint || Meaning || > |/statistics|Statistics information of stream| > |/receivers|A list of all receiver streams| > |/receivers/\[stream-id\]|Details of the given receiver stream| > |/batches|A list of all retained batches| > |/batches/\[batch-id\]|Details of the given batch| > |/batches/\[batch-id\]/operations|A list of all output operations of the > given batch| > |/batches/\[batch-id\]/operations/\[operation-id\]|Details of the given > operation, given batch| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type
[ https://issues.apache.org/jira/browse/SPARK-18054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770991#comment-15770991 ] Ilya Matiach commented on SPARK-18054: -- It looks like I can still repro the error with this code: val data = sc.parallelize(Seq((1.0, org.apache.spark.mllib.linalg.Vectors.dense(Array(1.0, 2.0, 3.0).toDF("label", "features") val extractProbability = udf((vector: DenseVector) => vector(1)) val dfWithProbability = data.withColumn("foo", extractProbability(col("features"))) The error message is: org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(features)' due to data type mismatch: argument 1 requires vector type, however, '`features`' is of vector type.; > Unexpected error from UDF that gets an element of a vector: argument 1 > requires vector type, however, '`_column_`' is of vector type > > > Key: SPARK-18054 > URL: https://issues.apache.org/jira/browse/SPARK-18054 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 2.0.1 >Reporter: Barry Becker >Priority: Minor > > Not sure if this is a bug in ML or a more core part of spark. > It used to work in spark 1.6.2, but now gives me an error. > I have a pipeline that contains a NaiveBayesModel which I created like this > {code} > val nbModel = new NaiveBayes() > .setLabelCol(target) > .setFeaturesCol(FEATURES_COL) > .setPredictionCol(PREDICTION_COLUMN) > .setProbabilityCol("_probability_column_") > .setModelType("multinomial") > {code} > When I apply that pipeline to some data there will be a > "_probability_column_" of type vector. I want to extract a probability for a > specific class label using the following, but it no longer works. > {code} > var newDf = pipeline.transform(df) > val extractProbability = udf((vector: DenseVector) => vector(1)) > val dfWithProbability = newDf.withColumn("foo", > extractProbability(col("_probability_column_"))) > {code} > The error I get now that I have upgraded to 2.0.1 from 1.6.2 is shnown below. > I consider this a strange error because its basically saying "argument 1 > requires a vector, but we got a vector instead". That does not make any sense > to me. It wants a vector, and a vector was given. Why does it fail? > {code} > org.apache.spark.sql.AnalysisException: cannot resolve > 'UDF(_class_probability_column__)' due to data type mismatch: argument 1 > requires vector type, however, '`_class_probability_column__`' is of vector > type.; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at
[jira] [Commented] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type
[ https://issues.apache.org/jira/browse/SPARK-18054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770981#comment-15770981 ] Ilya Matiach commented on SPARK-18054: -- Actually, that error message above looks different. Maybe the model transformed the dataset into something that had the old-style vector UDT? > Unexpected error from UDF that gets an element of a vector: argument 1 > requires vector type, however, '`_column_`' is of vector type > > > Key: SPARK-18054 > URL: https://issues.apache.org/jira/browse/SPARK-18054 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 2.0.1 >Reporter: Barry Becker >Priority: Minor > > Not sure if this is a bug in ML or a more core part of spark. > It used to work in spark 1.6.2, but now gives me an error. > I have a pipeline that contains a NaiveBayesModel which I created like this > {code} > val nbModel = new NaiveBayes() > .setLabelCol(target) > .setFeaturesCol(FEATURES_COL) > .setPredictionCol(PREDICTION_COLUMN) > .setProbabilityCol("_probability_column_") > .setModelType("multinomial") > {code} > When I apply that pipeline to some data there will be a > "_probability_column_" of type vector. I want to extract a probability for a > specific class label using the following, but it no longer works. > {code} > var newDf = pipeline.transform(df) > val extractProbability = udf((vector: DenseVector) => vector(1)) > val dfWithProbability = newDf.withColumn("foo", > extractProbability(col("_probability_column_"))) > {code} > The error I get now that I have upgraded to 2.0.1 from 1.6.2 is shnown below. > I consider this a strange error because its basically saying "argument 1 > requires a vector, but we got a vector instead". That does not make any sense > to me. It wants a vector, and a vector was given. Why does it fail? > {code} > org.apache.spark.sql.AnalysisException: cannot resolve > 'UDF(_class_probability_column__)' due to data type mismatch: argument 1 > requires vector type, however, '`_class_probability_column__`' is of vector > type.; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210) > at >
[jira] [Resolved] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type
[ https://issues.apache.org/jira/browse/SPARK-18054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18054. --- Resolution: Not A Problem > Unexpected error from UDF that gets an element of a vector: argument 1 > requires vector type, however, '`_column_`' is of vector type > > > Key: SPARK-18054 > URL: https://issues.apache.org/jira/browse/SPARK-18054 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 2.0.1 >Reporter: Barry Becker >Priority: Minor > > Not sure if this is a bug in ML or a more core part of spark. > It used to work in spark 1.6.2, but now gives me an error. > I have a pipeline that contains a NaiveBayesModel which I created like this > {code} > val nbModel = new NaiveBayes() > .setLabelCol(target) > .setFeaturesCol(FEATURES_COL) > .setPredictionCol(PREDICTION_COLUMN) > .setProbabilityCol("_probability_column_") > .setModelType("multinomial") > {code} > When I apply that pipeline to some data there will be a > "_probability_column_" of type vector. I want to extract a probability for a > specific class label using the following, but it no longer works. > {code} > var newDf = pipeline.transform(df) > val extractProbability = udf((vector: DenseVector) => vector(1)) > val dfWithProbability = newDf.withColumn("foo", > extractProbability(col("_probability_column_"))) > {code} > The error I get now that I have upgraded to 2.0.1 from 1.6.2 is shnown below. > I consider this a strange error because its basically saying "argument 1 > requires a vector, but we got a vector instead". That does not make any sense > to me. It wants a vector, and a vector was given. Why does it fail? > {code} > org.apache.spark.sql.AnalysisException: cannot resolve > 'UDF(_class_probability_column__)' due to data type mismatch: argument 1 > requires vector type, however, '`_class_probability_column__`' is of vector > type.; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:210) >
[jira] [Commented] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type
[ https://issues.apache.org/jira/browse/SPARK-18054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770972#comment-15770972 ] Ilya Matiach commented on SPARK-18054: -- It looks like this is already fixed in the latest version. The error message I got from Spark was: java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce. My test code was in NaiveBayesSuite, the code was: import org.apache.spark.sql.functions._ test("SPARK-18054: throw nice error message when using mllib vector instead of ml vector") { // Example taken from bug submitter // scalastyle:off println dataset.columns.foreach(println(_)) // scalastyle:on println val nbModel = new NaiveBayes() .setLabelCol(dataset.columns(0)) .setFeaturesCol(dataset.columns(1)) .setPredictionCol("_prediction_column_") .setProbabilityCol("_probability_column_") .setModelType("multinomial") .fit(dataset) val data = sc.parallelize(Seq((1.0, org.apache.spark.mllib.linalg.Vectors.dense(Array.empty[Double].toDF("label", "features") var newDf = nbModel.transform(data) val extractProbability = udf((vector: DenseVector) => vector(1)) val dfWithProbability = newDf.withColumn("foo", extractProbability(col("_probability_column_"))) } > Unexpected error from UDF that gets an element of a vector: argument 1 > requires vector type, however, '`_column_`' is of vector type > > > Key: SPARK-18054 > URL: https://issues.apache.org/jira/browse/SPARK-18054 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 2.0.1 >Reporter: Barry Becker >Priority: Minor > > Not sure if this is a bug in ML or a more core part of spark. > It used to work in spark 1.6.2, but now gives me an error. > I have a pipeline that contains a NaiveBayesModel which I created like this > {code} > val nbModel = new NaiveBayes() > .setLabelCol(target) > .setFeaturesCol(FEATURES_COL) > .setPredictionCol(PREDICTION_COLUMN) > .setProbabilityCol("_probability_column_") > .setModelType("multinomial") > {code} > When I apply that pipeline to some data there will be a > "_probability_column_" of type vector. I want to extract a probability for a > specific class label using the following, but it no longer works. > {code} > var newDf = pipeline.transform(df) > val extractProbability = udf((vector: DenseVector) => vector(1)) > val dfWithProbability = newDf.withColumn("foo", > extractProbability(col("_probability_column_"))) > {code} > The error I get now that I have upgraded to 2.0.1 from 1.6.2 is shnown below. > I consider this a strange error because its basically saying "argument 1 > requires a vector, but we got a vector instead". That does not make any sense > to me. It wants a vector, and a vector was given. Why does it fail? > {code} > org.apache.spark.sql.AnalysisException: cannot resolve > 'UDF(_class_probability_column__)' due to data type mismatch: argument 1 > requires vector type, however, '`_class_probability_column__`' is of vector > type.; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191) > at >
[jira] [Commented] (SPARK-18941) Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the directory associated with the Hive table (not EXTERNAL table) from the HDFS file system
[ https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770967#comment-15770967 ] Dongjoon Hyun commented on SPARK-18941: --- Hi, first of all. your case is correct. I can reproduce your example. Thanks. Currently, it seems to be intentional behavior because Spark assumes your table is EXTERNAL when users give locations. ``` scala> sql("create table table_with_location(a int) stored as orc location '/tmp/table_with_location'") res0: org.apache.spark.sql.DataFrame = [] scala> sql("desc extended table_with_location").show(false) ... |# Detailed Table Information|CatalogTable( Table: `default`.`table_with_location` Owner: dhyun Created: Thu Dec 22 12:01:35 PST 2016 Last Access: Wed Dec 31 16:00:00 PST 1969 Type: EXTERNAL Schema: [StructField(a,IntegerType,true)] Provider: hive Properties: [transient_lastDdlTime=1482436895] ... ``` Let me try to make a PR for this. > Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the > directory associated with the Hive table (not EXTERNAL table) from the HDFS > file system > - > > Key: SPARK-18941 > URL: https://issues.apache.org/jira/browse/SPARK-18941 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.2 >Reporter: luat > > Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the > directory associated with the Hive table (not EXTERNAL table) from the HDFS > file system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18941) Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the directory associated with the Hive table (not EXTERNAL table) from the HDFS file sys
[ https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770967#comment-15770967 ] Dongjoon Hyun edited comment on SPARK-18941 at 12/22/16 8:05 PM: - Hi, first of all. your case is correct. I can reproduce your example. Thanks. Currently, it seems to be intentional behavior because Spark assumes your table is EXTERNAL when users give locations. {code} scala> sql("create table table_with_location(a int) stored as orc location '/tmp/table_with_location'") scala> sql("desc extended table_with_location").show(false) ... |# Detailed Table Information|CatalogTable( Table: `default`.`table_with_location` Owner: dhyun Created: Thu Dec 22 12:01:35 PST 2016 Last Access: Wed Dec 31 16:00:00 PST 1969 Type: EXTERNAL Schema: [StructField(a,IntegerType,true)] Provider: hive Properties: [transient_lastDdlTime=1482436895] ... {code} Let me try to make a PR for this. was (Author: dongjoon): Hi, first of all. your case is correct. I can reproduce your example. Thanks. Currently, it seems to be intentional behavior because Spark assumes your table is EXTERNAL when users give locations. ``` scala> sql("create table table_with_location(a int) stored as orc location '/tmp/table_with_location'") res0: org.apache.spark.sql.DataFrame = [] scala> sql("desc extended table_with_location").show(false) ... |# Detailed Table Information|CatalogTable( Table: `default`.`table_with_location` Owner: dhyun Created: Thu Dec 22 12:01:35 PST 2016 Last Access: Wed Dec 31 16:00:00 PST 1969 Type: EXTERNAL Schema: [StructField(a,IntegerType,true)] Provider: hive Properties: [transient_lastDdlTime=1482436895] ... ``` Let me try to make a PR for this. > Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the > directory associated with the Hive table (not EXTERNAL table) from the HDFS > file system > - > > Key: SPARK-18941 > URL: https://issues.apache.org/jira/browse/SPARK-18941 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.2 >Reporter: luat > > Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the > directory associated with the Hive table (not EXTERNAL table) from the HDFS > file system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18984) Concat with ds.write.text() throw exception if column contains null data
Tony Fraser created SPARK-18984: --- Summary: Concat with ds.write.text() throw exception if column contains null data Key: SPARK-18984 URL: https://issues.apache.org/jira/browse/SPARK-18984 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Environment: spark2.02 scala 2.11.8 Reporter: Tony Fraser val customOutputFormat = outbound.select(concat( outbound.col("device_id"), lit ("\t"), lit("\"device\"=\""), col("device"),lit("\","), lit("\"_fd_cast\"=\""), col("_fd_cast"), lit("\",") ).where("device_type='ios_app' and _fd_cast is not null") customOutputFormat .limit(1000) .write .option("nullValue", "NULL") .mode("overwrite") .text("/filepath") There is no problem writing to JSON, CSV or Parquet. And above code works. As soon as you take out "and _fd_cast is not null" though it throws the exception below. And using either nullValue either treatEmptyValuesAsNulls either reading in or writing out doesn't seem to matter. Exception is: 16/12/22 14:16:18 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms 16/12/22 14:16:18 INFO DefaultWriterContainer: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 16/12/22 14:16:18 ERROR Utils: Aborting task java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:146) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/12/22 14:16:18 ERROR DefaultWriterContainer: Task attempt attempt_201612221416_0002_m_00_0 aborted. 16/12/22 14:16:18 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) org.apache.spark.SparkException: Task failed while writing rows at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.text.TextOutputWriter.writeInternal(TextFileFormat.scala:146) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:255) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) ... 8 more 16/12/22 14:16:18
[jira] [Resolved] (SPARK-18975) Add an API to remove SparkListener from SparkContext
[ https://issues.apache.org/jira/browse/SPARK-18975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18975. - Resolution: Fixed Assignee: Saisai Shao Fix Version/s: 2.2.0 > Add an API to remove SparkListener from SparkContext > - > > Key: SPARK-18975 > URL: https://issues.apache.org/jira/browse/SPARK-18975 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.2.0 > > > In current Spark we could add customized {{SparkListener}} through > {{SparkContext#addListener}} API, but there's no API to remove the registered > one. In our scenario SparkListener will be added repeatedly accordingly to > the changed environment. If lacks the ability to remove listeners, there > might be bunch of registered listeners finally, this is unnecessary and > potentially affect the performance. So here propose to add an API to remove > registered listener. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18983) Couldn't find leader offsets exception when the one of kafka cluster brokers is down
[ https://issues.apache.org/jira/browse/SPARK-18983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18983. --- Resolution: Invalid Target Version/s: (was: 1.6.1) Please read http://spark.apache.org/contributing.html This should go to u...@spark.apache.org But if all your brokers are down, this is the behavior you expect. > Couldn't find leader offsets exception when the one of kafka cluster brokers > is down > > > Key: SPARK-18983 > URL: https://issues.apache.org/jira/browse/SPARK-18983 > Project: Spark > Issue Type: Bug >Reporter: kraken >Priority: Critical > > Hello,i got a trouble today > when the company's PE restarts the one of kafka cluster broker,my spark job > run failed. > Exception in thread "main" org.apache.spark.SparkException: get earliest > leader offsets failed: ArrayBuffer(org.apache.spark.SparkException: Couldn't > find leaders for Set > I use the low level type to consume kafka log, i know that high level can be > aware of the lead changes, but how can i do that by use low level? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18982) Couldn't find leader offsets exception when the one of kafka cluster brokers is down
[ https://issues.apache.org/jira/browse/SPARK-18982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18982. --- Resolution: Duplicate Target Version/s: (was: 1.6.1) > Couldn't find leader offsets exception when the one of kafka cluster brokers > is down > > > Key: SPARK-18982 > URL: https://issues.apache.org/jira/browse/SPARK-18982 > Project: Spark > Issue Type: Bug >Reporter: kraken >Priority: Critical > > Hello,i got a trouble today > when the company's PE restarts the one of kafka cluster broker,my spark job > run failed. > Exception in thread "main" org.apache.spark.SparkException: get earliest > leader offsets failed: ArrayBuffer(org.apache.spark.SparkException: Couldn't > find leaders for Set > I use the low level type to consume kafka log, i know that high level can be > aware of the lead changes, but how can i do that by use low level? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18941) Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the directory associated with the Hive table (not EXTERNAL table) from the HDFS file system
[ https://issues.apache.org/jira/browse/SPARK-18941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770823#comment-15770823 ] Dongjoon Hyun commented on SPARK-18941: --- Thank you for the detail! I'll try that. > Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the > directory associated with the Hive table (not EXTERNAL table) from the HDFS > file system > - > > Key: SPARK-18941 > URL: https://issues.apache.org/jira/browse/SPARK-18941 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.2 >Reporter: luat > > Spark thrift server, Spark 2.0.2, The "drop table" command doesn't delete the > directory associated with the Hive table (not EXTERNAL table) from the HDFS > file system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18983) Couldn't find leader offsets exception when the one of kafka cluster brokers is down
kraken created SPARK-18983: -- Summary: Couldn't find leader offsets exception when the one of kafka cluster brokers is down Key: SPARK-18983 URL: https://issues.apache.org/jira/browse/SPARK-18983 Project: Spark Issue Type: Bug Reporter: kraken Priority: Critical Hello,i got a trouble today when the company's PE restarts the one of kafka cluster broker,my spark job run failed. Exception in thread "main" org.apache.spark.SparkException: get earliest leader offsets failed: ArrayBuffer(org.apache.spark.SparkException: Couldn't find leaders for Set I use the low level type to consume kafka log, i know that high level can be aware of the lead changes, but how can i do that by use low level? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18982) Couldn't find leader offsets exception when the one of kafka cluster brokers is down
kraken created SPARK-18982: -- Summary: Couldn't find leader offsets exception when the one of kafka cluster brokers is down Key: SPARK-18982 URL: https://issues.apache.org/jira/browse/SPARK-18982 Project: Spark Issue Type: Bug Reporter: kraken Priority: Critical Hello,i got a trouble today when the company's PE restarts the one of kafka cluster broker,my spark job run failed. Exception in thread "main" org.apache.spark.SparkException: get earliest leader offsets failed: ArrayBuffer(org.apache.spark.SparkException: Couldn't find leaders for Set I use the low level type to consume kafka log, i know that high level can be aware of the lead changes, but how can i do that by use low level? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18970) FileSource failure during file list refresh doesn't cause an application to fail, but stops further processing
[ https://issues.apache.org/jira/browse/SPARK-18970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lev updated SPARK-18970: Affects Version/s: 2.0.2 > FileSource failure during file list refresh doesn't cause an application to > fail, but stops further processing > -- > > Key: SPARK-18970 > URL: https://issues.apache.org/jira/browse/SPARK-18970 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.0.0, 2.0.2 >Reporter: Lev > > Spark streaming application uses S3 files as streaming sources. After running > for several day processing stopped even though an application continued to > run. > Stack trace: > java.io.FileNotFoundException: No such file or directory > 's3n://X' > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:818) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:511) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:465) > at > org.apache.spark.sql.execution.datasources.HadoopFsRelation$$anonfun$7$$anonfun$apply$3.apply(fileSourceInterfaces.scala:462) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310) > at scala.collection.AbstractIterator.to(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) > at > org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:893) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1897) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > I believe 2 things should (or can) be fixed: > 1. Application should fail in case of such an error. > 2. Allow application to ignore such failure, since there is a chance that > during next refresh the error will not resurface. (In my case I believe an > error was cased by S3 cleaning the bucket exactly at the same moment when > refresh was running) > My code to create streaming processing looks as the following: > val cq = sqlContext.readStream > .format("json") > .schema(struct) > .load(s"input") > .writeStream > .option("checkpointLocation", s"checkpoints") > .foreach(new ForeachWriter[Row] {...}) > .trigger(ProcessingTime("10 seconds")).start() > > cq.awaitTermination() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18973) Remove SortPartitions and RedistributeData
[ https://issues.apache.org/jira/browse/SPARK-18973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-18973. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > Remove SortPartitions and RedistributeData > -- > > Key: SPARK-18973 > URL: https://issues.apache.org/jira/browse/SPARK-18973 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.1, 2.2.0 > > > SortPartitions and RedistributeData logical operators are not actually used > and can be removed. Note that we do have a Sort operator (with global flag > false) that subsumed SortPartitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18031) Flaky test: org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic functionality
[ https://issues.apache.org/jira/browse/SPARK-18031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18031: - Fix Version/s: 2.0.3 > Flaky test: > org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic > functionality > --- > > Key: SPARK-18031 > URL: https://issues.apache.org/jira/browse/SPARK-18031 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu >Assignee: Shixiong Zhu > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite_name=basic+functionality -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18031) Flaky test: org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic functionality
[ https://issues.apache.org/jira/browse/SPARK-18031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18031: - Fix Version/s: 2.2.0 > Flaky test: > org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic > functionality > --- > > Key: SPARK-18031 > URL: https://issues.apache.org/jira/browse/SPARK-18031 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu >Assignee: Shixiong Zhu > Fix For: 2.1.1, 2.2.0 > > > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite_name=basic+functionality -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type
[ https://issues.apache.org/jira/browse/SPARK-18054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18054: -- Priority: Minor (was: Major) Component/s: Documentation Issue Type: Improvement (was: Bug) > Unexpected error from UDF that gets an element of a vector: argument 1 > requires vector type, however, '`_column_`' is of vector type > > > Key: SPARK-18054 > URL: https://issues.apache.org/jira/browse/SPARK-18054 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Affects Versions: 2.0.1 >Reporter: Barry Becker >Priority: Minor > > Not sure if this is a bug in ML or a more core part of spark. > It used to work in spark 1.6.2, but now gives me an error. > I have a pipeline that contains a NaiveBayesModel which I created like this > {code} > val nbModel = new NaiveBayes() > .setLabelCol(target) > .setFeaturesCol(FEATURES_COL) > .setPredictionCol(PREDICTION_COLUMN) > .setProbabilityCol("_probability_column_") > .setModelType("multinomial") > {code} > When I apply that pipeline to some data there will be a > "_probability_column_" of type vector. I want to extract a probability for a > specific class label using the following, but it no longer works. > {code} > var newDf = pipeline.transform(df) > val extractProbability = udf((vector: DenseVector) => vector(1)) > val dfWithProbability = newDf.withColumn("foo", > extractProbability(col("_probability_column_"))) > {code} > The error I get now that I have upgraded to 2.0.1 from 1.6.2 is shnown below. > I consider this a strange error because its basically saying "argument 1 > requires a vector, but we got a vector instead". That does not make any sense > to me. It wants a vector, and a vector was given. Why does it fail? > {code} > org.apache.spark.sql.AnalysisException: cannot resolve > 'UDF(_class_probability_column__)' due to data type mismatch: argument 1 > requires vector type, however, '`_class_probability_column__`' is of vector > type.; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at >
[jira] [Commented] (SPARK-18054) Unexpected error from UDF that gets an element of a vector: argument 1 requires vector type, however, '`_column_`' is of vector type
[ https://issues.apache.org/jira/browse/SPARK-18054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770690#comment-15770690 ] Ilya Matiach commented on SPARK-18054: -- I can try to repro this and add in a better error message. > Unexpected error from UDF that gets an element of a vector: argument 1 > requires vector type, however, '`_column_`' is of vector type > > > Key: SPARK-18054 > URL: https://issues.apache.org/jira/browse/SPARK-18054 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1 >Reporter: Barry Becker > > Not sure if this is a bug in ML or a more core part of spark. > It used to work in spark 1.6.2, but now gives me an error. > I have a pipeline that contains a NaiveBayesModel which I created like this > {code} > val nbModel = new NaiveBayes() > .setLabelCol(target) > .setFeaturesCol(FEATURES_COL) > .setPredictionCol(PREDICTION_COLUMN) > .setProbabilityCol("_probability_column_") > .setModelType("multinomial") > {code} > When I apply that pipeline to some data there will be a > "_probability_column_" of type vector. I want to extract a probability for a > specific class label using the following, but it no longer works. > {code} > var newDf = pipeline.transform(df) > val extractProbability = udf((vector: DenseVector) => vector(1)) > val dfWithProbability = newDf.withColumn("foo", > extractProbability(col("_probability_column_"))) > {code} > The error I get now that I have upgraded to 2.0.1 from 1.6.2 is shnown below. > I consider this a strange error because its basically saying "argument 1 > requires a vector, but we got a vector instead". That does not make any sense > to me. It wants a vector, and a vector was given. Why does it fail? > {code} > org.apache.spark.sql.AnalysisException: cannot resolve > 'UDF(_class_probability_column__)' due to data type mismatch: argument 1 > requires vector type, however, '`_class_probability_column__`' is of vector > type.; > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:82) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:191) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:201) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:205) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:205) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$5.apply(QueryPlan.scala:210) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at >
[jira] [Updated] (SPARK-18738) Some Spark SQL queries has poor performance on HDFS Erasure Coding feature when enabling dynamic allocation.
[ https://issues.apache.org/jira/browse/SPARK-18738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18738: -- Fix Version/s: (was: 2.2.0) > Some Spark SQL queries has poor performance on HDFS Erasure Coding feature > when enabling dynamic allocation. > > > Key: SPARK-18738 > URL: https://issues.apache.org/jira/browse/SPARK-18738 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Lifeng Wang > > We run TPCx-BB with Spark SQL engine on local cluster using Spark 2.0.3 trunk > and Hadoop 3.0 alpha 2 trunk. We run Spark SQL queries with same data size on > both Erasure Coding and 3-replication. The test results show that some > queries has much worse performance on EC compared to 3-replication. After > initial investigations, we found spark starts one third executors to execute > queries on EC compared to 3-replication. > We use query 30 as example, our cluster can totally launch 108 executors. > When we run the query from 3-replication database, spark will start all 108 > executors to execute the query. When we run the query from Erasure Coding > database, spark will launch 108 executors and kill 72 executors due to > they’re idle, at last there are only 36 executors to execute the query which > leads to poor performance. > This issue only happens when we enable dynamic allocations mechanism. When we > disable the dynamic allocations, Spark SQL query on EC has the similar > performance with on 3-replication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18234) Update mode in structured streaming
[ https://issues.apache.org/jira/browse/SPARK-18234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18234: -- Assignee: Tathagata Das > Update mode in structured streaming > --- > > Key: SPARK-18234 > URL: https://issues.apache.org/jira/browse/SPARK-18234 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Michael Armbrust >Assignee: Tathagata Das >Priority: Critical > Fix For: 2.1.1, 2.2.0 > > > We have this internal, but we should nail down the semantics and expose it to > users. The idea of update mode is that any tuple that changes will be > emitted. Open questions: > - do we need to reason about the {{keys}} for a given stream? For things > like the {{foreach}} sink its up to the user. However, for more end to end > use cases such as a JDBC sink, we need to know which row downstream is being > updated. > - okay to not support files? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18281) toLocalIterator yields time out error on pyspark2
[ https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18281: -- Assignee: Liang-Chi Hsieh > toLocalIterator yields time out error on pyspark2 > - > > Key: SPARK-18281 > URL: https://issues.apache.org/jira/browse/SPARK-18281 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1 > Environment: Ubuntu 14.04.5 LTS > Driver: AWS M4.XLARGE > Slaves: AWS M4.4.XLARGE > mesos 1.0.1 > spark 2.0.1 > pyspark >Reporter: Luke Miner >Assignee: Liang-Chi Hsieh > Fix For: 2.0.3, 2.1.1 > > > I run the example straight out of the api docs for toLocalIterator and it > gives a time out exception: > {code} > from pyspark import SparkContext > sc = SparkContext() > rdd = sc.parallelize(range(10)) > [x for x in rdd.toLocalIterator()] > {code} > conf file: > spark.driver.maxResultSize 6G > spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G > -XX:+HeapDumpOnOutOfMemoryError > spark.executor.memory 16G > spark.executor.uri foo/spark-2.0.1-bin-hadoop2.7.tgz > spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem > spark.hadoop.fs.s3a.buffer.dir /raid0/spark > spark.hadoop.fs.s3n.buffer.dir /raid0/spark > spark.hadoop.fs.s3a.connection.timeout 50 > spark.hadoop.fs.s3n.multipart.uploads.enabled true > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 > spark.hadoop.parquet.block.size 2147483648 > spark.hadoop.parquet.enable.summary-metadatafalse > spark.jars.packages > com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34 > spark.local.dir /raid0/spark > spark.mesos.coarse false > spark.mesos.constraints priority:1 > spark.network.timeout 600 > spark.rpc.message.maxSize500 > spark.speculation false > spark.sql.parquet.mergeSchema false > spark.sql.planner.externalSort true > spark.submit.deployMode client > spark.task.cpus 1 > Exception here: > {code} > --- > timeout Traceback (most recent call last) > in () > 2 sc = SparkContext() > 3 rdd = sc.parallelize(range(10)) > > 4 [x for x in rdd.toLocalIterator()] > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in > _load_from_socket(port, serializer) > 140 try: > 141 rf = sock.makefile("rb", 65536) > --> 142 for item in serializer.load_stream(rf): > 143 yield item > 144 finally: > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > load_stream(self, stream) > 137 while True: > 138 try: > --> 139 yield self._read_with_length(stream) > 140 except EOFError: > 141 return > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > _read_with_length(self, stream) > 154 > 155 def _read_with_length(self, stream): > --> 156 length = read_int(stream) > 157 if length == SpecialLengths.END_OF_DATA_SECTION: > 158 raise EOFError > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > read_int(stream) > 541 > 542 def read_int(stream): > --> 543 length = stream.read(4) > 544 if not length: > 545 raise EOFError > /usr/lib/python2.7/socket.pyc in read(self, size) > 378 # fragmentation issues on many platforms. > 379 try: > --> 380 data = self._sock.recv(left) > 381 except error, e: > 382 if e.args[0] == EINTR: > timeout: timed out > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17801) [ML]Random Forest Regression fails for large input
[ https://issues.apache.org/jira/browse/SPARK-17801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17801. --- Resolution: Not A Problem I think this is just attributable to extremely high maxBins, and not a bug. > [ML]Random Forest Regression fails for large input > -- > > Key: SPARK-17801 > URL: https://issues.apache.org/jira/browse/SPARK-17801 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04 >Reporter: samkit >Priority: Minor > > Random Forest Regression > Data:https://www.kaggle.com/c/grupo-bimbo-inventory-demand/download/train.csv.zip > Parameters: > NumTrees:500Maximum Bins:7477383 MaxDepth:27 > MinInstancesPerNode:8648 SamplingRate:1.0 > Java Options: > "-Xms16384M" "-Xmx16384M" "-Dspark.locality.wait=0s" > "-Dspark.driver.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+UseConcMarkSweepGC > -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:-UseAdaptiveSizePolicy > -XX:ConcGCThreads=2 -XX:-UseGCOverheadLimit > -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=8g -XX:MaxNewSize=8g > -XX:SurvivorRatio=3 -DnumPartitions=36" "-Dspark.submit.deployMode=cluster" > "-Dspark.speculation=true" " "-Dspark.speculation.multiplier=2" > "-Dspark.driver.memory=16g" "-Dspark.speculation.interval=300ms" > "-Dspark.speculation.quantile=0.5" "-Dspark.akka.frameSize=768" > "-Dspark.driver.supervise=false" "-Dspark.executor.cores=6" > "-Dspark.executor.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution > -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:+UseParallelOldGC > -XX:ParallelGCThreads=6 -XX:NewSize=22g -XX:MaxNewSize=22g > -XX:SurvivorRatio=2 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDateStamps" > "-Dspark.rpc.askTimeout=10" "-Dspark.executor.memory=40g" > "-Dspark.driver.maxResultSize=3g" "-Xss10240k" "-XX:+PrintGCDetails" > "-XX:+PrintGCTimeStamps" "-XX:+PrintTenuringDistribution" > "-XX:+UseConcMarkSweepGC" "-XX:+UseParNewGC" "-XX:ParallelGCThreads=2" > "-XX:-UseAdaptiveSizePolicy" "-XX:ConcGCThreads=2" "-XX:-UseGCOverheadLimit" > "-XX:CMSInitiatingOccupancyFraction=75" "-XX:NewSize=8g" "-XX:MaxNewSize=8g" > "-XX:SurvivorRatio=3" "-DnumPartitions=36" > Partial Driver StackTrace: > org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:740) > > org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:525) > org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:160) > > org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:209) > > org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:197) > org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > org.apache.spark.ml.Estimator.fit(Estimator.scala:59) > org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78) > org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78) > For complete Executor and Driver ErrorLog > https://gist.github.com/anonymous/603ac7f8f17e43c51ba93b2934cd4cb6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17801) [ML]Random Forest Regression fails for large input
[ https://issues.apache.org/jira/browse/SPARK-17801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770463#comment-15770463 ] Ilya Matiach commented on SPARK-17801: -- Taking a look into the error > [ML]Random Forest Regression fails for large input > -- > > Key: SPARK-17801 > URL: https://issues.apache.org/jira/browse/SPARK-17801 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.1 > Environment: Ubuntu 14.04 >Reporter: samkit >Priority: Minor > > Random Forest Regression > Data:https://www.kaggle.com/c/grupo-bimbo-inventory-demand/download/train.csv.zip > Parameters: > NumTrees:500Maximum Bins:7477383 MaxDepth:27 > MinInstancesPerNode:8648 SamplingRate:1.0 > Java Options: > "-Xms16384M" "-Xmx16384M" "-Dspark.locality.wait=0s" > "-Dspark.driver.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+UseConcMarkSweepGC > -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:-UseAdaptiveSizePolicy > -XX:ConcGCThreads=2 -XX:-UseGCOverheadLimit > -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=8g -XX:MaxNewSize=8g > -XX:SurvivorRatio=3 -DnumPartitions=36" "-Dspark.submit.deployMode=cluster" > "-Dspark.speculation=true" " "-Dspark.speculation.multiplier=2" > "-Dspark.driver.memory=16g" "-Dspark.speculation.interval=300ms" > "-Dspark.speculation.quantile=0.5" "-Dspark.akka.frameSize=768" > "-Dspark.driver.supervise=false" "-Dspark.executor.cores=6" > "-Dspark.executor.extraJavaOptions=-Xss10240k -XX:+PrintGCDetails > -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution > -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:+UseParallelOldGC > -XX:ParallelGCThreads=6 -XX:NewSize=22g -XX:MaxNewSize=22g > -XX:SurvivorRatio=2 -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDateStamps" > "-Dspark.rpc.askTimeout=10" "-Dspark.executor.memory=40g" > "-Dspark.driver.maxResultSize=3g" "-Xss10240k" "-XX:+PrintGCDetails" > "-XX:+PrintGCTimeStamps" "-XX:+PrintTenuringDistribution" > "-XX:+UseConcMarkSweepGC" "-XX:+UseParNewGC" "-XX:ParallelGCThreads=2" > "-XX:-UseAdaptiveSizePolicy" "-XX:ConcGCThreads=2" "-XX:-UseGCOverheadLimit" > "-XX:CMSInitiatingOccupancyFraction=75" "-XX:NewSize=8g" "-XX:MaxNewSize=8g" > "-XX:SurvivorRatio=3" "-DnumPartitions=36" > Partial Driver StackTrace: > org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.scala:740) > > org.apache.spark.ml.tree.impl.RandomForest$.findBestSplits(RandomForest.scala:525) > org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:160) > > org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:209) > > org.apache.spark.ml.regression.CustomRandomForestRegressor.train(CustomRandomForestRegressor.scala:197) > org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > org.apache.spark.ml.Estimator.fit(Estimator.scala:59) > org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78) > org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:78) > For complete Executor and Driver ErrorLog > https://gist.github.com/anonymous/603ac7f8f17e43c51ba93b2934cd4cb6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN
[ https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770446#comment-15770446 ] Ilya Matiach commented on SPARK-17975: -- Could you send a link to the repro dataset? I could work on this issue but it looks like you have a fix already. For any fixes we need tests to validate them. > EMLDAOptimizer fails with ClassCastException on YARN > > > Key: SPARK-17975 > URL: https://issues.apache.org/jira/browse/SPARK-17975 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 > Environment: Centos 6, CDH 5.7, Java 1.7u80 >Reporter: Jeff Stein > > I'm able to reproduce the error consistently with a 2000 record text file > with each record having 1-5 terms and checkpointing enabled. It looks like > the problem was introduced with the resolution for SPARK-13355. > The EdgeRDD class seems to be lying about it's type in a way that causes > RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an > RDD of Edge elements. > {code} > val spark = SparkSession.builder.appName("lda").getOrCreate() > spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") > val data: RDD[(Long, Vector)] = // snip > data.setName("data").cache() > val lda = new LDA > val optimizer = new EMLDAOptimizer > lda.setOptimizer(optimizer) > .setK(10) > .setMaxIterations(400) > .setAlpha(-1) > .setBeta(-1) > .setCheckpointInterval(7) > val ldaModel = lda.run(data) > {code} > {noformat} > 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID > 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be > cast to org.apache.spark.graphx.Edge > at > org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To
[jira] [Resolved] (SPARK-18922) Fix more resource-closing-related and path-related test failures in identified ones on Windows
[ https://issues.apache.org/jira/browse/SPARK-18922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18922. --- Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 2.2.0 If we find more like this in the short term, let's just reopen rather than make more JIRAs. Resolved by https://github.com/apache/spark/pull/16335 > Fix more resource-closing-related and path-related test failures in > identified ones on Windows > -- > > Key: SPARK-18922 > URL: https://issues.apache.org/jira/browse/SPARK-18922 > Project: Spark > Issue Type: Sub-task > Components: Tests >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.2.0 > > > There are more instances that are failed on Windows as below: > - {{LauncherBackendSuite}}: > {code} > - local: launcher handle *** FAILED *** (30 seconds, 120 milliseconds) > The code passed to eventually never returned normally. Attempted 283 times > over 30.0960053 seconds. Last failure message: The reference was null. > (LauncherBackendSuite.scala:56) > org.scalatest.exceptions.TestFailedDueToTimeoutException: > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) > - standalone/client: launcher handle *** FAILED *** (30 seconds, 47 > milliseconds) > The code passed to eventually never returned normally. Attempted 282 times > over 30.03798710002 seconds. Last failure message: The reference was > null. (LauncherBackendSuite.scala:56) > org.scalatest.exceptions.TestFailedDueToTimeoutException: > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) > {code} > - {{SQLQuerySuite}}: > {code} > - specifying database name for a temporary table is not allowed *** FAILED > *** (125 milliseconds) > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/C:projectsspark arget mpspark-1f4471ab-aac0-4239-ae35-833d54b37e52; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) > {code} > - {{JsonSuite}}: > {code} > - Loading a JSON dataset from a text file with SQL *** FAILED *** (94 > milliseconds) > org.apache.spark.sql.AnalysisException: Path does not exist: > file:/C:projectsspark arget mpspark-c918a8b7-fc09-433c-b9d0-36c0f78ae918; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) > {code} > - {{StateStoreSuite}}: > {code} > - SPARK-18342: commit fails when rename fails *** FAILED *** (16 milliseconds) > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0 > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:116) > at org.apache.hadoop.fs.Path.(Path.java:89) > ... > Cause: java.net.URISyntaxException: Relative path in absolute URI: > StateStoreSuite29777261fs://C:%5Cprojects%5Cspark%5Ctarget%5Ctmp%5Cspark-ef349862-7281-4963-aaf3-add0d670a4ad%5C?-2218c2f8-2cf6-4f80-9cdf-96354e8246a77685899733421033312/0 > at java.net.URI.checkPath(URI.java:1823) > at java.net.URI.(URI.java:745) > at org.apache.hadoop.fs.Path.initialize(Path.java:203) > {code} > - {{HDFSMetadataLogSuite}}: > {code} > - FileManager: FileContextManager *** FAILED *** (94 milliseconds) > java.io.IOException: Failed to delete: > C:\projects\spark\target\tmp\spark-415bb0bd-396b-444d-be82-04599e025f21 > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) > at > org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38) > - FileManager: FileSystemManager *** FAILED *** (78 milliseconds) > java.io.IOException: Failed to delete: > C:\projects\spark\target\tmp\spark-ef8222cd-85aa-47c0-a396-bc7979e15088 > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010) > at > org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:127) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.withTempDir(HDFSMetadataLogSuite.scala:38) > {code} >
[jira] [Resolved] (SPARK-18878) Fix/investigate the more identified test failures in Java/Scala on Windows
[ https://issues.apache.org/jira/browse/SPARK-18878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18878. --- Resolution: Done Let's look at reopening recent JIRAs and adding PRs if you find more changes of the same type as in previous changes. > Fix/investigate the more identified test failures in Java/Scala on Windows > -- > > Key: SPARK-18878 > URL: https://issues.apache.org/jira/browse/SPARK-18878 > Project: Spark > Issue Type: Test > Components: Tests >Reporter: Hyukjin Kwon > > It seems many tests are being failed on Windows. Some are only related with > tests whereas others are related with the functionalities themselves which > causes actual failures for some APIs on Windows. > The tests were hanging due to some issues in SPARK-17591 and SPARK-18785 and > now apparently we could proceed much further (apparently it seems we might > reach the end). > The tests proceeded via AppVeyor - > https://ci.appveyor.com/project/spark-test/spark/build/259-spark-test-windows -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18981) The last job hung when speculation is on
[ https://issues.apache.org/jira/browse/SPARK-18981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] roncenzhao updated SPARK-18981: --- Description: related settings: spark.speculation true spark.dynamicAllocation.minExecutors0 spark.executor.cores 4 When I run the follow app, the bug will trigger. ``` sc.runJob(job1) sleep(100s) sc.runJob(job2) // the job2 will hang and never be scheduled ``` The triggering condition is described as follows: condition1: During the sleeping time, the executors will be released and the # of the executor will be zero some seconds later. The #numExecutorsTarget in 'ExecutorAllocationManager' will be 0. condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks will be negative during the ending of job1's tasks. condition3: The job2 only hava one task. result: In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or negative. So the 'ExecutorAllocationManager' will not request container from yarn. The app will hang. was: related settings: spark.speculation true spark.dynamicAllocation.minExecutors0 spark.executor.cores 4 When I run the follow app, the bug will trigger. ``` sc.runJob(job1) sleep(100s) sc.runJob(job2) // the job2 will hang and never be scheduled ``` The triggering condition is described as follows: condition1: During the sleeping time, the executors will be released and the # of the executor will be zero some seconds later. The #numExecutorsTarget in 'ExecutorAllocationManager' will be 0. condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks will be negative during the ending of job1's tasks. condition3: The job2 only hava one task. result: In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or negative. So the 'ExecutorAllocationManager' will not request container from yarn. The app will be hung. > The last job hung when speculation is on > > > Key: SPARK-18981 > URL: https://issues.apache.org/jira/browse/SPARK-18981 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 > Environment: spark2.0.2 > hadoop2.5.0 >Reporter: roncenzhao >Priority: Critical > > related settings: > spark.speculation true > spark.dynamicAllocation.minExecutors0 > spark.executor.cores 4 > When I run the follow app, the bug will trigger. > ``` > sc.runJob(job1) > sleep(100s) > sc.runJob(job2) // the job2 will hang and never be scheduled > ``` > The triggering condition is described as follows: > condition1: During the sleeping time, the executors will be released and the > # of the executor will be zero some seconds later. The #numExecutorsTarget in > 'ExecutorAllocationManager' will be 0. > condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks > will be negative during the ending of job1's tasks. > condition3: The job2 only hava one task. > result: > In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', > we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, > #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or > negative. So the 'ExecutorAllocationManager' will not request container from > yarn. The app will hang. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18981) The last job hung when speculation is on
roncenzhao created SPARK-18981: -- Summary: The last job hung when speculation is on Key: SPARK-18981 URL: https://issues.apache.org/jira/browse/SPARK-18981 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.2 Environment: spark2.0.2 hadoop2.5.0 Reporter: roncenzhao Priority: Critical related settings: spark.speculation true spark.dynamicAllocation.minExecutors0 spark.executor.cores 4 When I run the follow app, the bug will trigger. ``` sc.runJob(job1) sleep(100s) sc.runJob(job2) // the job2 will hang and never be scheduled ``` The triggering condition is described as follows: condition1: During the sleeping time, the executors will be released and the # of the executor will be zero some seconds later. The #numExecutorsTarget in 'ExecutorAllocationManager' will be 0. condition2: In 'ExecutorAllocationListener.onTaskEnd()', the numRunningTasks will be negative during the ending of job1's tasks. condition3: The job2 only hava one task. result: In the method 'ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()', we will calculate #maxNeeded in 'maxNumExecutorsNeeded()'. Obviously, #numRunningOrPendingTasks will be negative and the #maxNeeded will be 0 or negative. So the 'ExecutorAllocationManager' will not request container from yarn. The app will be hung. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18199) Support appending to Parquet files
[ https://issues.apache.org/jira/browse/SPARK-18199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770282#comment-15770282 ] Soubhik Chakraborty commented on SPARK-18199: - Can't we use PARQUET-382 feature that got added in 1.9.0 ? SPARK-13127 already is in progress and assuming it won't hit any major hurdle, we should be able to use it right away. Only downside it doesn't allow append of two different schema which is understandable. > Support appending to Parquet files > -- > > Key: SPARK-18199 > URL: https://issues.apache.org/jira/browse/SPARK-18199 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jeremy Smith > > Currently, appending to a Parquet directory involves simply creating new > parquet files in the directory. With many small appends (for example, in a > streaming job with a short batch duration) this leads to an unbounded number > of small Parquet files accumulating. These must be cleaned up with some > frequency by removing them all and rewriting a new file containing all the > rows. > It would be far better if Spark supported appending to the Parquet files > themselves. HDFS supports this, as does Parquet: > * The Parquet footer can be read in order to obtain necessary metadata. > * The new rows can then be appended to the Parquet file as a row group. > * A new footer can then be appended containing the metadata and referencing > the new row groups as well as the previously existing row groups. > This would result in a small amount of bloat in the file as new row groups > are added (since duplicate metadata would accumulate) but it's hugely > preferable to accumulating small files, which is bad for HDFS health and also > eventually leads to Spark being unable to read the Parquet directory at all. > Periodic rewriting of the file could still be performed in order to remove > the duplicate metadata. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18878) Fix/investigate the more identified test failures in Java/Scala on Windows
[ https://issues.apache.org/jira/browse/SPARK-18878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15770030#comment-15770030 ] Hyukjin Kwon commented on SPARK-18878: -- Thank you for guiding me. Let me try to follow it. > Fix/investigate the more identified test failures in Java/Scala on Windows > -- > > Key: SPARK-18878 > URL: https://issues.apache.org/jira/browse/SPARK-18878 > Project: Spark > Issue Type: Test > Components: Tests >Reporter: Hyukjin Kwon > > It seems many tests are being failed on Windows. Some are only related with > tests whereas others are related with the functionalities themselves which > causes actual failures for some APIs on Windows. > The tests were hanging due to some issues in SPARK-17591 and SPARK-18785 and > now apparently we could proceed much further (apparently it seems we might > reach the end). > The tests proceeded via AppVeyor - > https://ci.appveyor.com/project/spark-test/spark/build/259-spark-test-windows -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18896) Suppress ScalaCheck warning -- Unknown ScalaCheck args provided when executing tests using sbt
[ https://issues.apache.org/jira/browse/SPARK-18896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769991#comment-15769991 ] PJ Fanning commented on SPARK-18896: I noticed from the pull request that you are looking at possibly upgrading scalatest too. Getting to scalatest 3.0.1 would be useful for later scala 2.12 support. Scalatest 2.x is not cross compiled for Scala 2.12. > Suppress ScalaCheck warning -- Unknown ScalaCheck args provided when > executing tests using sbt > -- > > Key: SPARK-18896 > URL: https://issues.apache.org/jira/browse/SPARK-18896 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Trivial > > While executing tests for {{DAGScheduler}} I've noticed the following warning: > {code} > > core/testOnly org.apache.spark.scheduler.DAGSchedulerSuite > ... > [info] Warning: Unknown ScalaCheck args provided: -oDF > {code} > The reason is due to a bug in ScalaCheck as reported in > https://github.com/rickynils/scalacheck/issues/212 and fixed in > https://github.com/rickynils/scalacheck/commit/df435a5 that is available in > ScalaCheck 1.13.4. > Spark uses [ScalaCheck > 1.12.5|https://github.com/apache/spark/blob/master/pom.xml#L717] which is > behind the latest 1.12.6 [released on Nov > 1|https://github.com/rickynils/scalacheck/releases] (not to mention 1.13.4). > Let's get rid of ScalaCheck's warning (and perhaps upgrade ScalaCheck along > the way too!). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18977) Heavy udf is not stopped by cancelJobGroup
[ https://issues.apache.org/jira/browse/SPARK-18977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitaly Gerasimov updated SPARK-18977: - Summary: Heavy udf is not stopped by cancelJobGroup (was: Heavy udf in not stopped by cancelJobGroup) > Heavy udf is not stopped by cancelJobGroup > -- > > Key: SPARK-18977 > URL: https://issues.apache.org/jira/browse/SPARK-18977 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2 >Reporter: Vitaly Gerasimov > > Let's say we have a heavy udf that processing during a long time. When I try > to run a job in job group that execute this udf and call cancelJobGroup(), > the job is still continue processing. > {code} > # ./spark-shell > > import scala.concurrent.Future > > import scala.concurrent.ExecutionContext.Implicits.global > > sc.setJobGroup("test-group", "udf-test") > > sqlContext.udf.register("sleep", (times: Int) => { (1 to > > times).toList.foreach{ _ => print("sleep..."); Thread.sleep(1) }; 1L }) > > Future { Thread.sleep(5); sc.cancelJobGroup("test-group") } > > sqlContext.sql("SELECT sleep(10)").collect() > {code} > It returns: > {code} > sleep...sleep...sleep...sleep...sleep...org.apache.spark.SparkException: Job > 0 cancelled part of cancelled job group test-group > > sleep...sleep...sleep...sleep...sleep...16/12/22 14:36:44 WARN > > TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): TaskKilled > > (killed intentionally) > {code} > It seems unexpectedly for me, but if I don't know something and it works as > expected feel free to close the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18980) implement Aggregator with TypedImperativeAggregate
[ https://issues.apache.org/jira/browse/SPARK-18980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18980: Assignee: Wenchen Fan (was: Apache Spark) > implement Aggregator with TypedImperativeAggregate > -- > > Key: SPARK-18980 > URL: https://issues.apache.org/jira/browse/SPARK-18980 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18980) implement Aggregator with TypedImperativeAggregate
[ https://issues.apache.org/jira/browse/SPARK-18980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18980: Assignee: Apache Spark (was: Wenchen Fan) > implement Aggregator with TypedImperativeAggregate > -- > > Key: SPARK-18980 > URL: https://issues.apache.org/jira/browse/SPARK-18980 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18980) implement Aggregator with TypedImperativeAggregate
[ https://issues.apache.org/jira/browse/SPARK-18980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769916#comment-15769916 ] Apache Spark commented on SPARK-18980: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/16383 > implement Aggregator with TypedImperativeAggregate > -- > > Key: SPARK-18980 > URL: https://issues.apache.org/jira/browse/SPARK-18980 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18980) implement Aggregator with TypedImperativeAggregate
Wenchen Fan created SPARK-18980: --- Summary: implement Aggregator with TypedImperativeAggregate Key: SPARK-18980 URL: https://issues.apache.org/jira/browse/SPARK-18980 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18979) ShutdownHookManager:Exception while deleting Spark temp dir
[ https://issues.apache.org/jira/browse/SPARK-18979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769827#comment-15769827 ] zuotingbing commented on SPARK-18979: - my SPARK_LOCAL_DIRS value is setted like this: SPARK_LOCAL_DIRS=/data2/zdh/spark/tmp,/data3/zdh/spark/tmp,/data4/zdh/spark/tmp,/data5/zdh/spark/tmp,/data6/zdh/spark/tmp,/data7/zdh/spark/tmp >From the worker log we find only 3 of 6 dirs delete failed : 【 Search "Deleting directory" (6 hits in 1 files) C:\Users\10159544\Desktop\logs\logs\spark-mr-worker-IDC-ZTECache-Logserver20-B12-U13.log (6 hits) Line 909563: 2016-12-15 20:09:52,629 INFO ShutdownHookManager: Deleting directory /data2/zdh/spark/tmp/spark-67cf188b-5978-42d9-8ce6-e181f0ba4d0d Line 910117: 2016-12-15 20:12:59,930 INFO ShutdownHookManager: Deleting directory /data6/zdh/spark/tmp/spark-688964cc-010d-44cd-b398-5f54fcee0b34 Line 910118: 2016-12-15 20:15:21,932 INFO ShutdownHookManager: Deleting directory /data3/zdh/spark/tmp/spark-f15e06d0-fedf-4618-8ff5-8c7b10a55cbe Line 913560: 2016-12-15 20:17:49,990 INFO ShutdownHookManager: Deleting directory /data5/zdh/spark/tmp/spark-16647cd9-57b0-4893-a876-5a39ebc4078f Line 915393: 2016-12-15 20:20:11,767 INFO ShutdownHookManager: Deleting directory /data7/zdh/spark/tmp/spark-18e24536-8739-4d1a-8381-c762b540addf Line 915394: 2016-12-15 20:20:11,767 INFO ShutdownHookManager: Deleting directory /data4/zdh/spark/tmp/spark-e7f52941-3e2c-48d7-993c-1df891e24577 Search "Exception while deleting" (3 hits in 1 files) C:\Users\10159544\Desktop\logs\logs\spark-mr-worker-IDC-ZTECache-Logserver20-B12-U13.log (3 hits) Line 910098: 2016-12-15 20:12:59,930 ERROR ShutdownHookManager: Exception while deleting Spark temp dir: /data2/zdh/spark/tmp/spark-67cf188b-5978-42d9-8ce6-e181f0ba4d0d Line 913541: 2016-12-15 20:17:49,990 ERROR ShutdownHookManager: Exception while deleting Spark temp dir: /data3/zdh/spark/tmp/spark-f15e06d0-fedf-4618-8ff5-8c7b10a55cbe Line 1044408: 2016-12-15 20:22:40,501 ERROR ShutdownHookManager: Exception while deleting Spark temp dir: /data4/zdh/spark/tmp/spark-e7f52941-3e2c-48d7-993c-1df891e24577 】 > ShutdownHookManager:Exception while deleting Spark temp dir > - > > Key: SPARK-18979 > URL: https://issues.apache.org/jira/browse/SPARK-18979 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: zuotingbing > > when i stop the worker process, the SPARK_LOCAL_DIRS should be delete > recursively but failed. Exception info: > 2016-12-15 20:12:59,930 ERROR ShutdownHookManager: Exception while deleting > Spark temp dir: > /data2/zdh/spark/tmp/spark-67cf188b-5978-42d9-8ce6-e181f0ba4d0d > java.io.IOException: Failed to delete: > /data2/zdh/spark/tmp/spark-67cf188b-5978-42d9-8ce6-e181f0ba4d0d > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at scala.util.Try$.apply(Try.scala:161) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To
[jira] [Resolved] (SPARK-18979) ShutdownHookManager:Exception while deleting Spark temp dir
[ https://issues.apache.org/jira/browse/SPARK-18979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-18979. --- Resolution: Duplicate > ShutdownHookManager:Exception while deleting Spark temp dir > - > > Key: SPARK-18979 > URL: https://issues.apache.org/jira/browse/SPARK-18979 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: zuotingbing > > when i stop the worker process, the SPARK_LOCAL_DIRS should be delete > recursively but failed. Exception info: > 2016-12-15 20:12:59,930 ERROR ShutdownHookManager: Exception while deleting > Spark temp dir: > /data2/zdh/spark/tmp/spark-67cf188b-5978-42d9-8ce6-e181f0ba4d0d > java.io.IOException: Failed to delete: > /data2/zdh/spark/tmp/spark-67cf188b-5978-42d9-8ce6-e181f0ba4d0d > at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) > at > org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) > at scala.util.Try$.apply(Try.scala:161) > at > org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234) > at > org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216) > at > org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18979) ShutdownHookManager:Exception while deleting Spark temp dir
zuotingbing created SPARK-18979: --- Summary: ShutdownHookManager:Exception while deleting Spark temp dir Key: SPARK-18979 URL: https://issues.apache.org/jira/browse/SPARK-18979 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.2 Reporter: zuotingbing when i stop the worker process, the SPARK_LOCAL_DIRS should be delete recursively but failed. Exception info: 2016-12-15 20:12:59,930 ERROR ShutdownHookManager: Exception while deleting Spark temp dir: /data2/zdh/spark/tmp/spark-67cf188b-5978-42d9-8ce6-e181f0ba4d0d java.io.IOException: Failed to delete: /data2/zdh/spark/tmp/spark-67cf188b-5978-42d9-8ce6-e181f0ba4d0d at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63) at org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18977) Heavy udf in not stopped by cancelJobGroup
[ https://issues.apache.org/jira/browse/SPARK-18977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769741#comment-15769741 ] Vitaly Gerasimov commented on SPARK-18977: -- Yeah.. You are right. But how do jobs stop in Spark, by thread interruption? > Heavy udf in not stopped by cancelJobGroup > -- > > Key: SPARK-18977 > URL: https://issues.apache.org/jira/browse/SPARK-18977 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2 >Reporter: Vitaly Gerasimov > > Let's say we have a heavy udf that processing during a long time. When I try > to run a job in job group that execute this udf and call cancelJobGroup(), > the job is still continue processing. > {code} > # ./spark-shell > > import scala.concurrent.Future > > import scala.concurrent.ExecutionContext.Implicits.global > > sc.setJobGroup("test-group", "udf-test") > > sqlContext.udf.register("sleep", (times: Int) => { (1 to > > times).toList.foreach{ _ => print("sleep..."); Thread.sleep(1) }; 1L }) > > Future { Thread.sleep(5); sc.cancelJobGroup("test-group") } > > sqlContext.sql("SELECT sleep(10)").collect() > {code} > It returns: > {code} > sleep...sleep...sleep...sleep...sleep...org.apache.spark.SparkException: Job > 0 cancelled part of cancelled job group test-group > > sleep...sleep...sleep...sleep...sleep...16/12/22 14:36:44 WARN > > TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): TaskKilled > > (killed intentionally) > {code} > It seems unexpectedly for me, but if I don't know something and it works as > expected feel free to close the issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data
[ https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769718#comment-15769718 ] zakaria hili edited comment on SPARK-18608 at 12/22/16 10:42 AM: - [~srowen], Now , I understand what you mean, your purpose is to optimize the memory,however, I think that all we need is to add an extra check, if the dataframe is cached, we call the mllib algo directly, if not, we have to cache the internal rdd But the problem of this solution is: if we do not cache the internal rdd, we will get a lot of warnings from mllib package (RDD is not cached) https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L216 because df.rdd.getStorageLevel == StorageLevel.NONE is true. So maybe we will need to create a new public class in mllib methods which can take the handlePersistence as parameter : if it has true there no need to recheck the input RDD was (Author: zahili): [~srowen], I understand now what you mean, your purpose is to optimize the memory,however, I think that all we need is to add an extra check, if the dataframe is cached, we call the mllib algo directly, if not, we have to cache the internal rdd But the problem of this solution is: if we do not cache the internal rdd, we will get a lot of warnings from mllib package (RDD is not cached) https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L216 because df.rdd.getStorageLevel == StorageLevel.NONE is true. So maybe we will need to create a new public class in mllib methods which can take the handlePersistence as parameter : if it has true there no need to recheck the input RDD > Spark ML algorithms that check RDD cache level for internal caching > double-cache data > - > > Key: SPARK-18608 > URL: https://issues.apache.org/jira/browse/SPARK-18608 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Nick Pentreath > > Some algorithms in Spark ML (e.g. {{LogisticRegression}}, > {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence > internally. They check whether the input dataset is cached, and if not they > cache it for performance. > However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. > This will actually always be true, since even if the dataset itself is > cached, the RDD returned by {{dataset.rdd}} will not be cached. > Hence if the input dataset is cached, the data will end up being cached > twice, which is wasteful. > To see this: > {code} > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val df = spark.range(10).toDF("num") > df: org.apache.spark.sql.DataFrame = [num: bigint] > scala> df.storageLevel == StorageLevel.NONE > res0: Boolean = true > scala> df.persist > res1: df.type = [num: bigint] > scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK > res2: Boolean = true > scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK > res3: Boolean = false > scala> df.rdd.getStorageLevel == StorageLevel.NONE > res4: Boolean = true > {code} > Before SPARK-16063, there was no way to check the storage level of the input > {{DataSet}}, but now we can, so the checks should be migrated to use > {{dataset.storageLevel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data
[ https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769718#comment-15769718 ] zakaria hili edited comment on SPARK-18608 at 12/22/16 10:41 AM: - [~srowen], I understand now what you mean, your purpose is to optimize the memory,however, I think that all we need is to add an extra check, if the dataframe is cached, we call the mllib algo directly, if not, we have to cache the internal rdd But the problem of this solution is: if we do not cache the internal rdd, we will get a lot of warnings from mllib package (RDD is not cached) https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L216 because df.rdd.getStorageLevel == StorageLevel.NONE is true. So maybe we will need to create a new public class in mllib methods which can take the handlePersistence as parameter : if it has true there no need to recheck the input RDD was (Author: zahili): I understand now what you mean, your purpose is to optimize the memory,however, I think that all we need is to add an extra check, if the dataframe is cached, we call the mllib algo directly, if not, we have to cache the internal rdd But the problem of this solution: if we do not cache the internal rdd, we will get a lot of warnings from mllib package (RDD is not cached) https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L216 because df.rdd.getStorageLevel == StorageLevel.NONE is true. So maybe we will need to create a new public class in mllib methods which can take the handlePersistence as parameter : if it has true there no need to recheck the input RDD > Spark ML algorithms that check RDD cache level for internal caching > double-cache data > - > > Key: SPARK-18608 > URL: https://issues.apache.org/jira/browse/SPARK-18608 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Nick Pentreath > > Some algorithms in Spark ML (e.g. {{LogisticRegression}}, > {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence > internally. They check whether the input dataset is cached, and if not they > cache it for performance. > However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. > This will actually always be true, since even if the dataset itself is > cached, the RDD returned by {{dataset.rdd}} will not be cached. > Hence if the input dataset is cached, the data will end up being cached > twice, which is wasteful. > To see this: > {code} > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val df = spark.range(10).toDF("num") > df: org.apache.spark.sql.DataFrame = [num: bigint] > scala> df.storageLevel == StorageLevel.NONE > res0: Boolean = true > scala> df.persist > res1: df.type = [num: bigint] > scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK > res2: Boolean = true > scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK > res3: Boolean = false > scala> df.rdd.getStorageLevel == StorageLevel.NONE > res4: Boolean = true > {code} > Before SPARK-16063, there was no way to check the storage level of the input > {{DataSet}}, but now we can, so the checks should be migrated to use > {{dataset.storageLevel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data
[ https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769718#comment-15769718 ] zakaria hili edited comment on SPARK-18608 at 12/22/16 10:39 AM: - I understand now what you mean, your purpose is to optimize the memory,however, I think that all we need is to add an extra check, if the dataframe is cached, we call the mllib algo directly, if not, we have to cache the internal rdd But the problem of this solution: if we do not cache the internal rdd, we will get a lot of warnings from mllib package (RDD is not cached) https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L216 because df.rdd.getStorageLevel == StorageLevel.NONE is true. So maybe we will need to create a new public class in mllib methods which can take the handlePersistence as parameter : if it has true there no need to recheck the input RDD was (Author: zahili): I understand now what you mean, your purpose is to optimize the memory,however, I think that all we need is to add an extra check, if the dataframe is cached, we call the mllib algo directly, if not, we have to cache the internal rdd But the problem of this solution: if we do not cache the internal rdd, we will get a lot of warnings from mllib package (RDD is not cached) https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L216 because df.rdd.getStorageLevel == StorageLevel.NONE is true. So maybe we will need to create a new public class in mllib methods which can take the handlePersistence as parameter : if it has true there no need to recheck the input RDD > Spark ML algorithms that check RDD cache level for internal caching > double-cache data > - > > Key: SPARK-18608 > URL: https://issues.apache.org/jira/browse/SPARK-18608 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Nick Pentreath > > Some algorithms in Spark ML (e.g. {{LogisticRegression}}, > {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence > internally. They check whether the input dataset is cached, and if not they > cache it for performance. > However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. > This will actually always be true, since even if the dataset itself is > cached, the RDD returned by {{dataset.rdd}} will not be cached. > Hence if the input dataset is cached, the data will end up being cached > twice, which is wasteful. > To see this: > {code} > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val df = spark.range(10).toDF("num") > df: org.apache.spark.sql.DataFrame = [num: bigint] > scala> df.storageLevel == StorageLevel.NONE > res0: Boolean = true > scala> df.persist > res1: df.type = [num: bigint] > scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK > res2: Boolean = true > scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK > res3: Boolean = false > scala> df.rdd.getStorageLevel == StorageLevel.NONE > res4: Boolean = true > {code} > Before SPARK-16063, there was no way to check the storage level of the input > {{DataSet}}, but now we can, so the checks should be migrated to use > {{dataset.storageLevel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data
[ https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769718#comment-15769718 ] zakaria hili commented on SPARK-18608: -- I understand now what you mean, your purpose is to optimize the memory,however, I think that all we need is to add an extra check, if the dataframe is cached, we will not cache the rdd, then call the mllib algo, if not, we have to cache the internal rdd But the problem of this solution: if we do not cache the internal rdd, we will get a lot of warnings from mllib package (RDD is not cached) https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L216 because df.rdd.getStorageLevel == StorageLevel.NONE is true. So maybe we will need to create a new public class in mllib methods which can take the handlePersistence as parameter : if it has true there no need to recheck the input RDD > Spark ML algorithms that check RDD cache level for internal caching > double-cache data > - > > Key: SPARK-18608 > URL: https://issues.apache.org/jira/browse/SPARK-18608 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Nick Pentreath > > Some algorithms in Spark ML (e.g. {{LogisticRegression}}, > {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence > internally. They check whether the input dataset is cached, and if not they > cache it for performance. > However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. > This will actually always be true, since even if the dataset itself is > cached, the RDD returned by {{dataset.rdd}} will not be cached. > Hence if the input dataset is cached, the data will end up being cached > twice, which is wasteful. > To see this: > {code} > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val df = spark.range(10).toDF("num") > df: org.apache.spark.sql.DataFrame = [num: bigint] > scala> df.storageLevel == StorageLevel.NONE > res0: Boolean = true > scala> df.persist > res1: df.type = [num: bigint] > scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK > res2: Boolean = true > scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK > res3: Boolean = false > scala> df.rdd.getStorageLevel == StorageLevel.NONE > res4: Boolean = true > {code} > Before SPARK-16063, there was no way to check the storage level of the input > {{DataSet}}, but now we can, so the checks should be migrated to use > {{dataset.storageLevel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18964) HiveContext does not support Time Interval Literals
[ https://issues.apache.org/jira/browse/SPARK-18964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769691#comment-15769691 ] Suhas Nalapure commented on SPARK-18964: My understanding is that both the features, namely Window functions (https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html) and Time Interval Literals are Spark SQL features and hence ideally should both be supported by the SQLContext which it does in Spark 2.0 but in any of the earlier versions Window Functions are not supported by the SQLContext. > HiveContext does not support Time Interval Literals > --- > > Key: SPARK-18964 > URL: https://issues.apache.org/jira/browse/SPARK-18964 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.6.0 >Reporter: Suhas Nalapure > > HiveContext does not recognize the Time Interval Literals mentioned here > https://databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.html. > E.g. The following Spark sql runs just fine when a SQLContext is used but > fails when HiveContext is used > > select *, case when `Order_Date` + INTERVAL 7 DAY > `Ship_Date` then "On > Time" else "Late" end as Shipment_On_Time from sales; > Logs: > -- > org.apache.spark.sql.AnalysisException: cannot recognize input near > 'INTERVAL' '7' 'DAY' in expression specification; line 2 pos 30 > at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:318) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) > at > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at > scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) > at > scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34) > at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:295) > at > org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:66) > at > org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:66) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:279) > at > org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:226) > at > org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:225) > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:268) > at org.apache.spark.sql.hive.HiveQLDialect.parse(HiveContext.scala:65) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211) > at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211) > at > org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114) > at > org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:113) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) > at >
[jira] [Commented] (SPARK-18878) Fix/investigate the more identified test failures in Java/Scala on Windows
[ https://issues.apache.org/jira/browse/SPARK-18878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769663#comment-15769663 ] Sean Owen commented on SPARK-18878: --- I understand why you need to fix some of these things in batches, but I see https://issues.apache.org/jira/browse/SPARK-17591 and https://issues.apache.org/jira/browse/SPARK-18785 with virtually the same title. This has two child JIRAs that are also pretty much identical. This is no longer very meaningful. Let's close down all but one JIRA for resource-closing problems, and not resolve it until you're pretty sure they're done. Let's also close general "umbrella" JIRAs like this. If there is a new type of Windows problem, there can be a new JIRA targeted at that new type of fix, and again we can make many PRs for one JIRA if needed to fix all of it in batches. > Fix/investigate the more identified test failures in Java/Scala on Windows > -- > > Key: SPARK-18878 > URL: https://issues.apache.org/jira/browse/SPARK-18878 > Project: Spark > Issue Type: Test > Components: Tests >Reporter: Hyukjin Kwon > > It seems many tests are being failed on Windows. Some are only related with > tests whereas others are related with the functionalities themselves which > causes actual failures for some APIs on Windows. > The tests were hanging due to some issues in SPARK-17591 and SPARK-18785 and > now apparently we could proceed much further (apparently it seems we might > reach the end). > The tests proceeded via AppVeyor - > https://ci.appveyor.com/project/spark-test/spark/build/259-spark-test-windows -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18976) in standlone mode,executor expired by HeartbeanReceiver that still take up cores but no tasks assigned to
[ https://issues.apache.org/jira/browse/SPARK-18976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-18976: -- Target Version/s: (was: 1.6.1) Fix Version/s: (was: 1.6.1) [~liujianhui] please read http://spark.apache.org/contributing.html before opening a JIRA. Don't set fix/target version. It doesn't even make sense to target a version released a year ago. Can you test this vs 2.1 or master? Can you provide any reproduction or more detail? I'm not sure this is enough info. > in standlone mode,executor expired by HeartbeanReceiver that still take up > cores but no tasks assigned to > -- > > Key: SPARK-18976 > URL: https://issues.apache.org/jira/browse/SPARK-18976 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.1 > Environment: jdk1.8.0_77 Red Hat 4.4.7-11 >Reporter: liujianhui > Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png > > > h2. scene > when executor expired by HeartbeatReceiver in driver, driver will mark that > executor as not live, task scheduler will not assign tasks to that executor, > but that executor's status will always be running and take up cores, the > executor 18 was expired and no task running, the task time far less than the > normal executor 142, but in app page, the executor is running > !screenshot-1.png! > !screenshot-2.png! > !screenshot-3.png! > h2.process: > # exeuctor expired by HearbeatReceiver because the last heartbeat execeed the > executor timeout > # executor will be removed in CoarseGrainedSchdulerBackend.killExecutors, so > that executor will marked as dead, it will not scheduled as offer since now > because it in executorsPendingToRemove > # status of that executor is running because the CoarseGrainedExecutorBackend > processor is also exist and it register block manager to the driver every > 10s, log as > {code} > 16/12/22 17:04:26 INFO Executor: Told to re-register on heartbeat > 16/12/22 17:04:26 INFO BlockManager: BlockManager re-registering with master > 16/12/22 17:04:26 INFO BlockManagerMaster: Trying to register BlockManager > 16/12/22 17:04:26 INFO BlockManagerMaster: Registered BlockManager > 16/12/22 17:04:26 INFO BlockManager: Reporting 0 blocks to the master. > 16/12/22 17:04:36 INFO Executor: Told to re-register on heartbeat > 16/12/22 17:04:36 INFO BlockManager: BlockManager re-registering with master > 16/12/22 17:04:36 INFO BlockManagerMaster: Trying to register BlockManager > 16/12/22 17:04:36 INFO BlockManagerMaster: Registered BlockManager > 16/12/22 17:04:36 INFO BlockManager: Reporting 0 blocks to the master. > 16/12/22 17:04:46 INFO Executor: Told to re-register on heartbeat > 16/12/22 17:04:46 INFO BlockManager: BlockManager re-registering with master > 16/12/22 17:04:46 INFO BlockManagerMaster: Trying to register BlockManager > 16/12/22 17:04:46 INFO BlockManagerMaster: Registered BlockManager > 16/12/22 17:04:46 INFO BlockManager: Reporting 0 blocks to the master. > 16/12/22 17:04:56 INFO Executor: Told to re-register on heartbeat > 16/12/22 17:04:56 INFO BlockManager: BlockManager re-registering with master > 16/12/22 17:04:56 INFO BlockManagerMaster: Trying to register BlockManager > 16/12/22 17:04:56 INFO BlockManagerMaster: Registered BlockManager > 16/12/22 17:04:56 INFO BlockManager: Reporting 0 blocks to the master. > {code} > h2. resolve > when the register times exceed some threshold(e.g. 10), the executor should > exit as zero -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data
[ https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15769648#comment-15769648 ] Sean Owen commented on SPARK-18608: --- I don't think this has to do with Pyspark. The situation is different: input is cached, but the intermediate RDD created internally is not, and so is cached again. > Spark ML algorithms that check RDD cache level for internal caching > double-cache data > - > > Key: SPARK-18608 > URL: https://issues.apache.org/jira/browse/SPARK-18608 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Nick Pentreath > > Some algorithms in Spark ML (e.g. {{LogisticRegression}}, > {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence > internally. They check whether the input dataset is cached, and if not they > cache it for performance. > However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. > This will actually always be true, since even if the dataset itself is > cached, the RDD returned by {{dataset.rdd}} will not be cached. > Hence if the input dataset is cached, the data will end up being cached > twice, which is wasteful. > To see this: > {code} > scala> import org.apache.spark.storage.StorageLevel > import org.apache.spark.storage.StorageLevel > scala> val df = spark.range(10).toDF("num") > df: org.apache.spark.sql.DataFrame = [num: bigint] > scala> df.storageLevel == StorageLevel.NONE > res0: Boolean = true > scala> df.persist > res1: df.type = [num: bigint] > scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK > res2: Boolean = true > scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK > res3: Boolean = false > scala> df.rdd.getStorageLevel == StorageLevel.NONE > res4: Boolean = true > {code} > Before SPARK-16063, there was no way to check the storage level of the input > {{DataSet}}, but now we can, so the checks should be migrated to use > {{dataset.storageLevel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org