[jira] [Commented] (SPARK-32534) Cannot load a Pipeline Model on a stopped Spark Context
[ https://issues.apache.org/jira/browse/SPARK-32534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171294#comment-17171294 ] Hyukjin Kwon commented on SPARK-32534: -- [~kvanlieshout]can you show the full steps to reproduce and the error messages you got? > Cannot load a Pipeline Model on a stopped Spark Context > --- > > Key: SPARK-32534 > URL: https://issues.apache.org/jira/browse/SPARK-32534 > Project: Spark > Issue Type: Bug > Components: Deploy, Kubernetes >Affects Versions: 2.4.6 >Reporter: Kevin Van Lieshout >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > I am running Spark in a Kubernetes cluster than is running Spark NLP using > the Pyspark ML Pipeline Model class to load the model and then transform on > the spark dataframe. We run this within a docker container that starts up a > spark context, mounts volumes, spins up executors, etc and then does it > transformations, udfs, etc and then closes down the spark context. The first > time I load the model when my service has just been started, everything is > fine. If I run my application for a second time without resetting my service, > even though the context is entirely stopped from the previous run and a new > one is started up, the Pipeline Model has some attribute in one of its base > classes that thinks the context its running on is closed, so then I get a : > cannot call a function on a stopped spark context when I try and load the > model in my service again. I have to shut down my service each time if I want > consecutive runs through my spark pipeline, which is not ideal, so I was > wondering if this was a common issue amongst fellow pyspark users that use > Pipeline Model, or is there a common work around to resetting all spark > contexts or whether the pipeline model caches a spark context of some sort. > Any help is very useful. > > > cls.pipeline = PipelineModel.read().load(NLP_MODEL) > > is how I load the model. And our spark context is very similar to a typical > kubernetes/spark setup. Nothing special there -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32534) Cannot load a Pipeline Model on a stopped Spark Context
[ https://issues.apache.org/jira/browse/SPARK-32534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171293#comment-17171293 ] Hyukjin Kwon commented on SPARK-32534: -- Please avoid setting Blocker+ for Priority which is usually reserved for committers. > Cannot load a Pipeline Model on a stopped Spark Context > --- > > Key: SPARK-32534 > URL: https://issues.apache.org/jira/browse/SPARK-32534 > Project: Spark > Issue Type: Bug > Components: Deploy, Kubernetes >Affects Versions: 2.4.6 >Reporter: Kevin Van Lieshout >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > I am running Spark in a Kubernetes cluster than is running Spark NLP using > the Pyspark ML Pipeline Model class to load the model and then transform on > the spark dataframe. We run this within a docker container that starts up a > spark context, mounts volumes, spins up executors, etc and then does it > transformations, udfs, etc and then closes down the spark context. The first > time I load the model when my service has just been started, everything is > fine. If I run my application for a second time without resetting my service, > even though the context is entirely stopped from the previous run and a new > one is started up, the Pipeline Model has some attribute in one of its base > classes that thinks the context its running on is closed, so then I get a : > cannot call a function on a stopped spark context when I try and load the > model in my service again. I have to shut down my service each time if I want > consecutive runs through my spark pipeline, which is not ideal, so I was > wondering if this was a common issue amongst fellow pyspark users that use > Pipeline Model, or is there a common work around to resetting all spark > contexts or whether the pipeline model caches a spark context of some sort. > Any help is very useful. > > > cls.pipeline = PipelineModel.read().load(NLP_MODEL) > > is how I load the model. And our spark context is very similar to a typical > kubernetes/spark setup. Nothing special there -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32534) Cannot load a Pipeline Model on a stopped Spark Context
[ https://issues.apache.org/jira/browse/SPARK-32534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32534: - Priority: Minor (was: Blocker) > Cannot load a Pipeline Model on a stopped Spark Context > --- > > Key: SPARK-32534 > URL: https://issues.apache.org/jira/browse/SPARK-32534 > Project: Spark > Issue Type: Bug > Components: Deploy, Kubernetes >Affects Versions: 2.4.6 >Reporter: Kevin Van Lieshout >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > I am running Spark in a Kubernetes cluster than is running Spark NLP using > the Pyspark ML Pipeline Model class to load the model and then transform on > the spark dataframe. We run this within a docker container that starts up a > spark context, mounts volumes, spins up executors, etc and then does it > transformations, udfs, etc and then closes down the spark context. The first > time I load the model when my service has just been started, everything is > fine. If I run my application for a second time without resetting my service, > even though the context is entirely stopped from the previous run and a new > one is started up, the Pipeline Model has some attribute in one of its base > classes that thinks the context its running on is closed, so then I get a : > cannot call a function on a stopped spark context when I try and load the > model in my service again. I have to shut down my service each time if I want > consecutive runs through my spark pipeline, which is not ideal, so I was > wondering if this was a common issue amongst fellow pyspark users that use > Pipeline Model, or is there a common work around to resetting all spark > contexts or whether the pipeline model caches a spark context of some sort. > Any help is very useful. > > > cls.pipeline = PipelineModel.read().load(NLP_MODEL) > > is how I load the model. And our spark context is very similar to a typical > kubernetes/spark setup. Nothing special there -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32535) Query with broadcast hints fail when query has a WITH clause
[ https://issues.apache.org/jira/browse/SPARK-32535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32535: - Component/s: (was: Spark Core) SQL > Query with broadcast hints fail when query has a WITH clause > > > Key: SPARK-32535 > URL: https://issues.apache.org/jira/browse/SPARK-32535 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Arvind Krishnan >Priority: Major > > If a query has a WITH clause and a query hint (like `BROADCAST`), the query > fails > In the below code sample, executing `sql2` fails, but `sql1` passes. > {code:java} > import spark.implicits._ > val df = List( > ("1", "B", "C"), > ("A", "2", "C"), > ("A", "B", "3") > ).toDF("COL_A", "COL_B", "COL_C") > df.createOrReplaceTempView("table1") > val df1 = List( > ("A", "2", "3"), > ("1", "B", "3"), > ("1", "2", "C") > ).toDF("COL_A", "COL_B", "COL_C") > df1.createOrReplaceTempView("table2") > val sql1 = "select /*+ BROADCAST(a) */ a.COL_A from table1 a inner join > table2 b on a.COL_A = b.COL_A" > val sql2 = "with X as (select /*+ BROADCAST(a) */ a.COL_A from table1 a inner > join table2 b on a.COL_A = b.COL_A) select X.COL_A from X" > val df2 = spark.sql(sql2) > println(s"Row Count ${df2.count()}") > println("Rows... ") > df2.show(false) > {code} > > I tried executing this sample program with spark2.4.0, and both sql > statements work -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32536) deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement to dynamic partition
[ https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yx91490 updated SPARK-32536: Description: when execute insert overwrite table statement to dynamic partition : {code:java} set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nostrict; insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name where dt='2001'; {code} output log: {code:java} 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with parameters partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001, table=id_name2, partSpec={dt=2001}, loadFileType=REPLACE_ALL, listBucketingLevel=0, isAcid=false, resetStatistics=false org.apache.hadoop.hive.ql.metadata.HiveException: Directory hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be cleaned up. at org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666) at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132) at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588) at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.FileNotFoundException: File hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053) at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131) at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113) at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910) at org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681) at org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661) ... 8 more Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception when loading 1 in table id_name2 with loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1; {code} it seems that spark doesn't test if the partitions hdfs locations whether exists before delete it. and Hive can successfully execute the same sql. was: when execute insert overwrite table statement to dynamic partition : {code:java} set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nostrict; insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name where dt='2001'; {code} output log: {code:java} 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with parameters partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001, table=id_name2, partSpec={dt=2001}, loadFileType=REPLACE_ALL, listBucketingLevel=0, isAcid=false, resetStatistics=false org.apache.hadoop.hive.ql.metadata.HiveException: Directory hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be cleaned up. at org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666) at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132) at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588) at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.FileNotFoundException: File hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053) at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131) at org.apache.hadoop.hdfs.DistributedFileSystem$24.
[jira] [Updated] (SPARK-32536) deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement to dynamic partition
[ https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yx91490 updated SPARK-32536: Summary: deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement to dynamic partition (was: deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement with dynamic partition) > deleted not existing hdfs locations when use spark sql to execute "insert > overwrite" statement to dynamic partition > --- > > Key: SPARK-32536 > URL: https://issues.apache.org/jira/browse/SPARK-32536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: HDP version 2.3.2.3.1.4.0-315 >Reporter: yx91490 >Priority: Major > Attachments: SPARK-32536.full.log > > > when execute insert overwrite table statement to dynamic partition : > > {code:java} > set hive.exec.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nostrict; > insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name > where dt='2001'; > {code} > output log: > {code:java} > 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with > parameters > partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001, > table=id_name2, partSpec={dt=2001}, loadFileType=REPLACE_ALL, > listBucketingLevel=0, isAcid=false, resetStatistics=false > org.apache.hadoop.hive.ql.metadata.HiveException: Directory > hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be > cleaned up. > at > org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666) > at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597) > at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132) > at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588) > at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: File > hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910) > at > org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681) > at > org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661) > ... 8 more > Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception > when loading 1 in table id_name2 with > loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1; > {code} > it seems that spark doesn't test if the partitions hdfs locations whether > exists before delete it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32536) deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement with dynamic partition
[ https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yx91490 updated SPARK-32536: Summary: deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement with dynamic partition (was: deleted not existin hdfs locations when use spark sql to execute "insert overwrite" statement with dynamic partition) > deleted not existing hdfs locations when use spark sql to execute "insert > overwrite" statement with dynamic partition > - > > Key: SPARK-32536 > URL: https://issues.apache.org/jira/browse/SPARK-32536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: HDP version 2.3.2.3.1.4.0-315 >Reporter: yx91490 >Priority: Major > Attachments: SPARK-32536.full.log > > > when execute insert overwrite table statement to dynamic partition : > > {code:java} > set hive.exec.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nostrict; > insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name > where dt='2001'; > {code} > output log: > {code:java} > 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with > parameters > partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001, > table=id_name2, partSpec={dt=2001}, loadFileType=REPLACE_ALL, > listBucketingLevel=0, isAcid=false, resetStatistics=false > org.apache.hadoop.hive.ql.metadata.HiveException: Directory > hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be > cleaned up. > at > org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666) > at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597) > at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132) > at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588) > at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: File > hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910) > at > org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681) > at > org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661) > ... 8 more > Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception > when loading 1 in table id_name2 with > loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1; > {code} > it seems that spark doesn't test if the partitions hdfs locations whether > exists before delete it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32536) deleted not existin hdfs locations when use spark sql to execute "insert overwrite" statement with dynamic partition
[ https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yx91490 updated SPARK-32536: Summary: deleted not existin hdfs locations when use spark sql to execute "insert overwrite" statement with dynamic partition (was: deleted not existin hdfs locations when use spark sql to execute "insert overwrite" dynamic partition statement) > deleted not existin hdfs locations when use spark sql to execute "insert > overwrite" statement with dynamic partition > > > Key: SPARK-32536 > URL: https://issues.apache.org/jira/browse/SPARK-32536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: HDP version 2.3.2.3.1.4.0-315 >Reporter: yx91490 >Priority: Major > Attachments: SPARK-32536.full.log > > > when execute insert overwrite table statement to dynamic partition : > > {code:java} > set hive.exec.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nostrict; > insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name > where dt='2001'; > {code} > output log: > {code:java} > 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with > parameters > partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001, > table=id_name2, partSpec={dt=2001}, loadFileType=REPLACE_ALL, > listBucketingLevel=0, isAcid=false, resetStatistics=false > org.apache.hadoop.hive.ql.metadata.HiveException: Directory > hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be > cleaned up. > at > org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666) > at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597) > at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132) > at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588) > at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: File > hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910) > at > org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681) > at > org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661) > ... 8 more > Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception > when loading 1 in table id_name2 with > loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1; > {code} > it seems that spark doesn't test if the partitions hdfs locations whether > exists before delete it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32536) deleted not existin hdfs locations when use spark sql to execute "insert overwrite" dynamic partition statement
[ https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yx91490 updated SPARK-32536: Summary: deleted not existin hdfs locations when use spark sql to execute "insert overwrite" dynamic partition statement (was: deleted not existing partition hdfs locations when use spark sql to execute "insert overwrite" dynamic partition statement) > deleted not existin hdfs locations when use spark sql to execute "insert > overwrite" dynamic partition statement > --- > > Key: SPARK-32536 > URL: https://issues.apache.org/jira/browse/SPARK-32536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: HDP version 2.3.2.3.1.4.0-315 >Reporter: yx91490 >Priority: Major > Attachments: SPARK-32536.full.log > > > when execute insert overwrite table statement to dynamic partition : > > {code:java} > set hive.exec.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nostrict; > insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name > where dt='2001'; > {code} > output log: > {code:java} > 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with > parameters > partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001, > table=id_name2, partSpec={dt=2001}, loadFileType=REPLACE_ALL, > listBucketingLevel=0, isAcid=false, resetStatistics=false > org.apache.hadoop.hive.ql.metadata.HiveException: Directory > hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be > cleaned up. > at > org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666) > at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597) > at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132) > at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588) > at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: File > hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910) > at > org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681) > at > org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661) > ... 8 more > Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception > when loading 1 in table id_name2 with > loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1; > {code} > it seems that spark doesn't test if the partitions hdfs locations whether > exists before delete it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32536) deleted not existing partition hdfs locations when use spark sql to execute "insert overwrite" dynamic partition statement
[ https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yx91490 updated SPARK-32536: Summary: deleted not existing partition hdfs locations when use spark sql to execute "insert overwrite" dynamic partition statement (was: spark sql insert overwrite dynamic partition deleted not existing partition hdfs locations) > deleted not existing partition hdfs locations when use spark sql to execute > "insert overwrite" dynamic partition statement > -- > > Key: SPARK-32536 > URL: https://issues.apache.org/jira/browse/SPARK-32536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: HDP version 2.3.2.3.1.4.0-315 >Reporter: yx91490 >Priority: Major > Attachments: SPARK-32536.full.log > > > when execute insert overwrite table statement to dynamic partition : > > {code:java} > set hive.exec.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nostrict; > insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name > where dt='2001'; > {code} > output log: > {code:java} > 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with > parameters > partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001, > table=id_name2, partSpec={dt=2001}, loadFileType=REPLACE_ALL, > listBucketingLevel=0, isAcid=false, resetStatistics=false > org.apache.hadoop.hive.ql.metadata.HiveException: Directory > hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be > cleaned up. > at > org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666) > at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597) > at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132) > at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588) > at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: File > hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910) > at > org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681) > at > org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661) > ... 8 more > Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception > when loading 1 in table id_name2 with > loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1; > {code} > it seems that spark doesn't test if the partitions hdfs locations whether > exists before delete it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32536) spark sql insert overwrite dynamic partition deleted not existing partition hdfs locations
[ https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yx91490 updated SPARK-32536: Summary: spark sql insert overwrite dynamic partition deleted not existing partition hdfs locations (was: spark sql insert overwrite dynamic partition error) > spark sql insert overwrite dynamic partition deleted not existing partition > hdfs locations > -- > > Key: SPARK-32536 > URL: https://issues.apache.org/jira/browse/SPARK-32536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: HDP version 2.3.2.3.1.4.0-315 >Reporter: yx91490 >Priority: Major > Attachments: SPARK-32536.full.log > > > when execute insert overwrite table statement to dynamic partition : > > {code:java} > set hive.exec.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nostrict; > insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name > where dt='2001'; > {code} > output log: > {code:java} > 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with > parameters > partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001, > table=id_name2, partSpec={dt=2001}, loadFileType=REPLACE_ALL, > listBucketingLevel=0, isAcid=false, resetStatistics=false > org.apache.hadoop.hive.ql.metadata.HiveException: Directory > hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be > cleaned up. > at > org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666) > at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597) > at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132) > at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588) > at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: File > hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910) > at > org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681) > at > org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661) > ... 8 more > Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception > when loading 1 in table id_name2 with > loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1; > {code} > it seems that spark doesn't test if the partitions hdfs locations whether > exists before delete it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32536) spark sql insert overwrite dynamic partition error
[ https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yx91490 updated SPARK-32536: Description: when execute insert overwrite table statement to dynamic partition : {code:java} set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nostrict; insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name where dt='2001'; {code} output log: {code:java} 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with parameters partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001, table=id_name2, partSpec={dt=2001}, loadFileType=REPLACE_ALL, listBucketingLevel=0, isAcid=false, resetStatistics=false org.apache.hadoop.hive.ql.metadata.HiveException: Directory hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be cleaned up. at org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666) at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132) at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588) at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.FileNotFoundException: File hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053) at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131) at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113) at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910) at org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681) at org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661) ... 8 more Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception when loading 1 in table id_name2 with loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1; {code} it seems that spark doesn't test if the partitions hdfs locations whether exists before delete it. was: when execute insert overwrite table statement to dynamic partition : {code:java} set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nostrict; insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name where dt='2001'; {code} output log: {code:java} 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with parameters partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001, table=id_name2, partSpec={dt=2001}, loadFileType=REPLACE_ALL, listBucketingLevel=0, isAcid=false, resetStatistics=false org.apache.hadoop.hive.ql.metadata.HiveException: Directory hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be cleaned up. at org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666) at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132) at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588) at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.FileNotFoundException: File hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053) at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131) at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113) at o
[jira] [Updated] (SPARK-31851) Redesign PySpark documentation
[ https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31851: - Description: Currently, PySpark documentation (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much poorly written compared to other projects. See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an example. PySpark is being more and more important in Spark, and we should improve this documentation so people can easily follow. Reference: - https://koalas.readthedocs.io/en/latest/ - https://pandas.pydata.org/docs/ was: Currently, PySpark documentation (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much poorly written compared to other projects. See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an exmaple. PySpark is being more and more important in Spark, and we should improve this documentation so people can easily follow. Reference: - https://koalas.readthedocs.io/en/latest/ - https://pandas.pydata.org/docs/ > Redesign PySpark documentation > -- > > Key: SPARK-31851 > URL: https://issues.apache.org/jira/browse/SPARK-31851 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark, Spark Core, SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > > Currently, PySpark documentation > (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much > poorly written compared to other projects. > See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an > example. > PySpark is being more and more important in Spark, and we should improve this > documentation so people can easily follow. > Reference: > - https://koalas.readthedocs.io/en/latest/ > - https://pandas.pydata.org/docs/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31851) Redesign PySpark documentation
[ https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31851: - Description: Currently, PySpark documentation (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much poorly written compared to other projects. See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an exmaple. PySpark is being more and more important in Spark, and we should improve this documentation so people can easily follow. Reference: - https://koalas.readthedocs.io/en/latest/ - https://pandas.pydata.org/docs/ was: Currently, PySpark documentation (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much poorly written compared to other projects. See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an exmaple. PySpark is being more and more important in Spark, and we should improve this documentation so people can easily follow. Reference: - https://koalas.readthedocs.io/en/latest/ > Redesign PySpark documentation > -- > > Key: SPARK-31851 > URL: https://issues.apache.org/jira/browse/SPARK-31851 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark, Spark Core, SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > > Currently, PySpark documentation > (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much > poorly written compared to other projects. > See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an > exmaple. > PySpark is being more and more important in Spark, and we should improve this > documentation so people can easily follow. > Reference: > - https://koalas.readthedocs.io/en/latest/ > - https://pandas.pydata.org/docs/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32536) spark sql insert overwrite dynamic partition error
[ https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yx91490 updated SPARK-32536: Attachment: SPARK-32536.full.log > spark sql insert overwrite dynamic partition error > -- > > Key: SPARK-32536 > URL: https://issues.apache.org/jira/browse/SPARK-32536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 > Environment: HDP version 2.3.2.3.1.4.0-315 >Reporter: yx91490 >Priority: Major > Attachments: SPARK-32536.full.log > > > when execute insert overwrite table statement to dynamic partition : > > {code:java} > set hive.exec.dynamic.partition=true; > set hive.exec.dynamic.partition.mode=nostrict; > insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name > where dt='2001'; > {code} > output log: > {code:java} > 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with > parameters > partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001, > table=id_name2, partSpec={dt=2001}, loadFileType=REPLACE_ALL, > listBucketingLevel=0, isAcid=false, resetStatistics=false > org.apache.hadoop.hive.ql.metadata.HiveException: Directory > hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be > cleaned up. > at > org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666) > at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597) > at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132) > at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588) > at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: File > hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868) > at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910) > at > org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681) > at > org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661) > ... 8 more > Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception > when loading 1 in table id_name2 with > loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32536) spark sql insert overwrite dynamic partition error
yx91490 created SPARK-32536: --- Summary: spark sql insert overwrite dynamic partition error Key: SPARK-32536 URL: https://issues.apache.org/jira/browse/SPARK-32536 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.2 Environment: HDP version 2.3.2.3.1.4.0-315 Reporter: yx91490 when execute insert overwrite table statement to dynamic partition : {code:java} set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nostrict; insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name where dt='2001'; {code} output log: {code:java} 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with parameters partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001, table=id_name2, partSpec={dt=2001}, loadFileType=REPLACE_ALL, listBucketingLevel=0, isAcid=false, resetStatistics=false org.apache.hadoop.hive.ql.metadata.HiveException: Directory hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be cleaned up. at org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666) at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132) at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588) at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.FileNotFoundException: File hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053) at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131) at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113) at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910) at org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681) at org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661) ... 8 more Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception when loading 1 in table id_name2 with loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1; {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package
[ https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171284#comment-17171284 ] Hyukjin Kwon commented on SPARK-32187: -- [~fhoering], I made one example at SPARK-32507 to refer. Please also see https://issues.apache.org/jira/browse/SPARK-31851?focusedCommentId=17171275&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17171275 Would you be able to start writing a page up about PEX? If you're not used to shipping Python packages with zipped files or .py files, you can only write it only about the PEX for now. I can file a separate JIRA for that if that's better for you. > User Guide - Shipping Python Package > > > Key: SPARK-32187 > URL: https://issues.apache.org/jira/browse/SPARK-32187 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > - Zipped file > - Python files > - PEX \(?\) (see also SPARK-25433) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31851) Redesign PySpark documentation
[ https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171275#comment-17171275 ] Hyukjin Kwon edited comment on SPARK-31851 at 8/5/20, 6:24 AM: --- SPARK-32507 was merged. People should be able to refer this as an example. If you guys are interested in taking some of sub-tasks here, please feel free to go ahead! Useful links: http://spark.apache.org/contributing.html Build the doc: - Official way: https://github.com/apache/spark/tree/master/docs - Unofficial way: {{cd python/docs && make clean html}} and open {{python/docs/build/html/index.html}}. - See also dependencies needed to build the doc https://github.com/apache/spark/blob/master/.github/workflows/master.yml#L230 As an example, if you're adding a page under "User Guide", you might have to do: 1. go to the source: {code} cd python/docs/source/user_guide {code} 2. Write up a page you want. Let's suppose you wrote `shipping_pagkages.rst`. 3. Put it in {{python/docs/source/user_guide/index.rst}}: {code} ... == User Guide == .. toctree:: :maxdepth: 2 shipping_pagkages ... {code} was (Author: hyukjin.kwon): SPARK-32507 was merged. People should be able to refer this as an example. If you guys are interested in taking some of sub-tasks here, please feel free to go ahead! Useful links: http://spark.apache.org/contributing.html Build the doc: - Official way: https://github.com/apache/spark/tree/master/docs - Unofficial way: cd python/docs && make clean html - See also dependencies needed to build the doc https://github.com/apache/spark/blob/master/.github/workflows/master.yml#L230 As an example, if you're adding a page under "User Guide", you might have to do: 1. go to the source: {code} cd python/docs/source/user_guide {code} 2. Write up a page you want. Let's suppose you wrote `shipping_pagkages.rst`. 3. Put it in {{python/docs/source/user_guide/index.rst}}: {code} ... == User Guide == .. toctree:: :maxdepth: 2 shipping_pagkages ... {code} > Redesign PySpark documentation > -- > > Key: SPARK-31851 > URL: https://issues.apache.org/jira/browse/SPARK-31851 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark, Spark Core, SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > > Currently, PySpark documentation > (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much > poorly written compared to other projects. > See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an > exmaple. > PySpark is being more and more important in Spark, and we should improve this > documentation so people can easily follow. > Reference: > - https://koalas.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31851) Redesign PySpark documentation
[ https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171275#comment-17171275 ] Hyukjin Kwon edited comment on SPARK-31851 at 8/5/20, 6:23 AM: --- SPARK-32507 was merged. People should be able to refer this as an example. If you guys are interested in taking some of sub-tasks here, please feel free to go ahead! Useful links: http://spark.apache.org/contributing.html Build the doc: - Official way: https://github.com/apache/spark/tree/master/docs - Unofficial way: cd python/docs && make clean html - See also dependencies needed to build the doc https://github.com/apache/spark/blob/master/.github/workflows/master.yml#L230 As an example, if you're adding a page under "User Guide", you might have to do: 1. go to the source: {code} cd python/docs/source/user_guide {code} 2. Write up a page you want. Let's suppose you wrote `shipping_pagkages.rst`. 3. Put it in {{python/docs/source/user_guide/index.rst}}: {code} ... == User Guide == .. toctree:: :maxdepth: 2 shipping_pagkages ... {code} was (Author: hyukjin.kwon): SPARK-32507 was merged. People should be able to refer this as an example. If you guys are interested in taking some of sub-tasks here, please feel free to go ahead! Useful links: http://spark.apache.org/contributing.html Build the doc: - Official way: https://github.com/apache/spark/tree/master/docs - Unofficial way: cd python/docs && make clean html - See also dependencies needed to build the doc https://github.com/apache/spark/blob/master/.github/workflows/master.yml#L230 > Redesign PySpark documentation > -- > > Key: SPARK-31851 > URL: https://issues.apache.org/jira/browse/SPARK-31851 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark, Spark Core, SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > > Currently, PySpark documentation > (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much > poorly written compared to other projects. > See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an > exmaple. > PySpark is being more and more important in Spark, and we should improve this > documentation so people can easily follow. > Reference: > - https://koalas.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31851) Redesign PySpark documentation
[ https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171275#comment-17171275 ] Hyukjin Kwon edited comment on SPARK-31851 at 8/5/20, 6:14 AM: --- SPARK-32507 was merged. People should be able to refer this as an example. If you guys are interested in taking some of sub-tasks here, please feel free to go ahead! Useful links: http://spark.apache.org/contributing.html Build the doc: - Official way: https://github.com/apache/spark/tree/master/docs - Unofficial way: cd python/docs && make clean html - See also dependencies needed to build the doc https://github.com/apache/spark/blob/master/.github/workflows/master.yml#L230 was (Author: hyukjin.kwon): SPARK-32507 was merged. People should be able to refer this as an example. If you guys are interested in taking some of sub-tasks here, please feel free to go ahead! > Redesign PySpark documentation > -- > > Key: SPARK-31851 > URL: https://issues.apache.org/jira/browse/SPARK-31851 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark, Spark Core, SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > > Currently, PySpark documentation > (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much > poorly written compared to other projects. > See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an > exmaple. > PySpark is being more and more important in Spark, and we should improve this > documentation so people can easily follow. > Reference: > - https://koalas.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31851) Redesign PySpark documentation
[ https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171275#comment-17171275 ] Hyukjin Kwon commented on SPARK-31851: -- SPARK-32507 was merged. People should be able to refer this as an example. If you guys are interested in taking some of sub-tasks here, please feel free to go ahead! > Redesign PySpark documentation > -- > > Key: SPARK-31851 > URL: https://issues.apache.org/jira/browse/SPARK-31851 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark, Spark Core, SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > > Currently, PySpark documentation > (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much > poorly written compared to other projects. > See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an > exmaple. > PySpark is being more and more important in Spark, and we should improve this > documentation so people can easily follow. > Reference: > - https://koalas.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32528) The analyze method should make sure the plan is analyzed
[ https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-32528: Description: In tests, we usually call `plan.analyze` to get the analyzed plan and test analyzer/optimizer rules. However, `plan.analyze` doesn't guarantee the plan is resolved, which may surprise the test writers. > The analyze method should make sure the plan is analyzed > > > Key: SPARK-32528 > URL: https://issues.apache.org/jira/browse/SPARK-32528 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > > In tests, we usually call `plan.analyze` to get the analyzed plan and test > analyzer/optimizer rules. However, `plan.analyze` doesn't guarantee the plan > is resolved, which may surprise the test writers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32507) Main Page
[ https://issues.apache.org/jira/browse/SPARK-32507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32507. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29320 [https://github.com/apache/spark/pull/29320] > Main Page > - > > Key: SPARK-32507 > URL: https://issues.apache.org/jira/browse/SPARK-32507 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > We should make a main package to overview PySpark properly. See the demo > example: > https://hyukjin-spark.readthedocs.io/en/latest/ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32535) Query with broadcast hints fail when query has a WITH clause
Arvind Krishnan created SPARK-32535: --- Summary: Query with broadcast hints fail when query has a WITH clause Key: SPARK-32535 URL: https://issues.apache.org/jira/browse/SPARK-32535 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Arvind Krishnan If a query has a WITH clause and a query hint (like `BROADCAST`), the query fails In the below code sample, executing `sql2` fails, but `sql1` passes. {code:java} import spark.implicits._ val df = List( ("1", "B", "C"), ("A", "2", "C"), ("A", "B", "3") ).toDF("COL_A", "COL_B", "COL_C") df.createOrReplaceTempView("table1") val df1 = List( ("A", "2", "3"), ("1", "B", "3"), ("1", "2", "C") ).toDF("COL_A", "COL_B", "COL_C") df1.createOrReplaceTempView("table2") val sql1 = "select /*+ BROADCAST(a) */ a.COL_A from table1 a inner join table2 b on a.COL_A = b.COL_A" val sql2 = "with X as (select /*+ BROADCAST(a) */ a.COL_A from table1 a inner join table2 b on a.COL_A = b.COL_A) select X.COL_A from X" val df2 = spark.sql(sql2) println(s"Row Count ${df2.count()}") println("Rows... ") df2.show(false) {code} I tried executing this sample program with spark2.4.0, and both sql statements work -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32502) Please fix CVE related to Guava 14.0.1
[ https://issues.apache.org/jira/browse/SPARK-32502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171197#comment-17171197 ] L. C. Hsieh commented on SPARK-32502: - I did some testings in the PRs. Few changes are required to pass the failed Hive tests: # Shading Guava at hive-exec packaging and a few code changes to hive-common and hive-exec regarding Guava usage # Don't use core classifier for hive dependencies in Spark But this just upgrades Guava version used in Spark. Hive dependencies still use older Guava with the reported CVE. > Please fix CVE related to Guava 14.0.1 > -- > > Key: SPARK-32502 > URL: https://issues.apache.org/jira/browse/SPARK-32502 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Rodney Aaron Stainback >Priority: Major > > Please fix the following CVE related to Guava 14.0.1 > |cve|severity|cvss| > |CVE-2018-10237|medium|5.9| > > Our security team is trying to block us from using spark because of this issue > > One thing that's very weird is I see from this [pom > file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/pom.xml]] > you reference guava but it's not clear what version. > > But if I look on > [maven|[https://mvnrepository.com/artifact/org.apache.spark/spark-network-common_2.12/3.0.0]] > the guava reference is not showing up > > Is this reference somehow being shaded into the network common jar? It's not > clear to me. > > Also, I've noticed code like [this > file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/src/main/java/org/apache/spark/network/util/LimitedInputStream.java]] > which is a copy-paste of some guava source code. > > The CVE scanner we use Twistlock/Palo Alto Networks - Prisma Cloud Compute > Edition is very thorough and will find CVEs in copy-pasted code and shaded > jars. > > Please fix this CVE so we can use spark -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31419) Document Table-valued Function and Inline Table
[ https://issues.apache.org/jira/browse/SPARK-31419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171187#comment-17171187 ] Apache Spark commented on SPARK-31419: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/29355 > Document Table-valued Function and Inline Table > --- > > Key: SPARK-31419 > URL: https://issues.apache.org/jira/browse/SPARK-31419 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > Document Table-valued Function and Inline Table -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32533) Improve Avro read/write performance on nested structs and array of structs
[ https://issues.apache.org/jira/browse/SPARK-32533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171184#comment-17171184 ] Apache Spark commented on SPARK-32533: -- User 'msamirkhan' has created a pull request for this issue: https://github.com/apache/spark/pull/29354 > Improve Avro read/write performance on nested structs and array of structs > -- > > Key: SPARK-32533 > URL: https://issues.apache.org/jira/browse/SPARK-32533 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Have some improvements for Avro file format to reduce time taken when > reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was > able to improve performance on branch-3.0 as follows (measurements in > seconds): > Read: > Nested Structs: 75 -> 46 > Array of Struct: 47 -> 17 > Write > Nested Structs: 147 -> 36 > Array of Struct: 139 -> 34 > Will be putting up the PR soon with the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31419) Document Table-valued Function and Inline Table
[ https://issues.apache.org/jira/browse/SPARK-31419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171185#comment-17171185 ] Apache Spark commented on SPARK-31419: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/29355 > Document Table-valued Function and Inline Table > --- > > Key: SPARK-31419 > URL: https://issues.apache.org/jira/browse/SPARK-31419 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > Document Table-valued Function and Inline Table -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32533) Improve Avro read/write performance on nested structs and array of structs
[ https://issues.apache.org/jira/browse/SPARK-32533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32533: Assignee: Apache Spark > Improve Avro read/write performance on nested structs and array of structs > -- > > Key: SPARK-32533 > URL: https://issues.apache.org/jira/browse/SPARK-32533 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Assignee: Apache Spark >Priority: Major > > Have some improvements for Avro file format to reduce time taken when > reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was > able to improve performance on branch-3.0 as follows (measurements in > seconds): > Read: > Nested Structs: 75 -> 46 > Array of Struct: 47 -> 17 > Write > Nested Structs: 147 -> 36 > Array of Struct: 139 -> 34 > Will be putting up the PR soon with the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32533) Improve Avro read/write performance on nested structs and array of structs
[ https://issues.apache.org/jira/browse/SPARK-32533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32533: Assignee: (was: Apache Spark) > Improve Avro read/write performance on nested structs and array of structs > -- > > Key: SPARK-32533 > URL: https://issues.apache.org/jira/browse/SPARK-32533 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Have some improvements for Avro file format to reduce time taken when > reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was > able to improve performance on branch-3.0 as follows (measurements in > seconds): > Read: > Nested Structs: 75 -> 46 > Array of Struct: 47 -> 17 > Write > Nested Structs: 147 -> 36 > Array of Struct: 139 -> 34 > Will be putting up the PR soon with the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32533) Improve Avro read/write performance on nested structs and array of structs
[ https://issues.apache.org/jira/browse/SPARK-32533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171183#comment-17171183 ] Apache Spark commented on SPARK-32533: -- User 'msamirkhan' has created a pull request for this issue: https://github.com/apache/spark/pull/29354 > Improve Avro read/write performance on nested structs and array of structs > -- > > Key: SPARK-32533 > URL: https://issues.apache.org/jira/browse/SPARK-32533 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Have some improvements for Avro file format to reduce time taken when > reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was > able to improve performance on branch-3.0 as follows (measurements in > seconds): > Read: > Nested Structs: 75 -> 46 > Array of Struct: 47 -> 17 > Write > Nested Structs: 147 -> 36 > Array of Struct: 139 -> 34 > Will be putting up the PR soon with the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32534) Cannot load a Pipeline Model on a stopped Spark Context
Kevin Van Lieshout created SPARK-32534: -- Summary: Cannot load a Pipeline Model on a stopped Spark Context Key: SPARK-32534 URL: https://issues.apache.org/jira/browse/SPARK-32534 Project: Spark Issue Type: Bug Components: Deploy, Kubernetes Affects Versions: 2.4.6 Reporter: Kevin Van Lieshout I am running Spark in a Kubernetes cluster than is running Spark NLP using the Pyspark ML Pipeline Model class to load the model and then transform on the spark dataframe. We run this within a docker container that starts up a spark context, mounts volumes, spins up executors, etc and then does it transformations, udfs, etc and then closes down the spark context. The first time I load the model when my service has just been started, everything is fine. If I run my application for a second time without resetting my service, even though the context is entirely stopped from the previous run and a new one is started up, the Pipeline Model has some attribute in one of its base classes that thinks the context its running on is closed, so then I get a : cannot call a function on a stopped spark context when I try and load the model in my service again. I have to shut down my service each time if I want consecutive runs through my spark pipeline, which is not ideal, so I was wondering if this was a common issue amongst fellow pyspark users that use Pipeline Model, or is there a common work around to resetting all spark contexts or whether the pipeline model caches a spark context of some sort. Any help is very useful. cls.pipeline = PipelineModel.read().load(NLP_MODEL) is how I load the model. And our spark context is very similar to a typical kubernetes/spark setup. Nothing special there -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs
[ https://issues.apache.org/jira/browse/SPARK-32532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32532: Assignee: (was: Apache Spark) > Improve ORC read/write performance on nested structs and array of structs > - > > Key: SPARK-32532 > URL: https://issues.apache.org/jira/browse/SPARK-32532 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Have some improvements for ORC file format to reduce time taken when > reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was > able to improve performance on branch-3.0 as follows (measurements in > seconds): > Read: > Nested Structs: 184 -> 44 > Array of Struct: 66 -> 15 > Write > Nested Structs: 543 -> 39 > Array of Struct: 330 -> 37 > Will be putting up the PR soon with the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs
[ https://issues.apache.org/jira/browse/SPARK-32532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171118#comment-17171118 ] Apache Spark commented on SPARK-32532: -- User 'msamirkhan' has created a pull request for this issue: https://github.com/apache/spark/pull/29353 > Improve ORC read/write performance on nested structs and array of structs > - > > Key: SPARK-32532 > URL: https://issues.apache.org/jira/browse/SPARK-32532 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Have some improvements for ORC file format to reduce time taken when > reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was > able to improve performance on branch-3.0 as follows (measurements in > seconds): > Read: > Nested Structs: 184 -> 44 > Array of Struct: 66 -> 15 > Write > Nested Structs: 543 -> 39 > Array of Struct: 330 -> 37 > Will be putting up the PR soon with the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs
[ https://issues.apache.org/jira/browse/SPARK-32532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171117#comment-17171117 ] Apache Spark commented on SPARK-32532: -- User 'msamirkhan' has created a pull request for this issue: https://github.com/apache/spark/pull/29353 > Improve ORC read/write performance on nested structs and array of structs > - > > Key: SPARK-32532 > URL: https://issues.apache.org/jira/browse/SPARK-32532 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Have some improvements for ORC file format to reduce time taken when > reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was > able to improve performance on branch-3.0 as follows (measurements in > seconds): > Read: > Nested Structs: 184 -> 44 > Array of Struct: 66 -> 15 > Write > Nested Structs: 543 -> 39 > Array of Struct: 330 -> 37 > Will be putting up the PR soon with the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs
[ https://issues.apache.org/jira/browse/SPARK-32532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32532: Assignee: Apache Spark > Improve ORC read/write performance on nested structs and array of structs > - > > Key: SPARK-32532 > URL: https://issues.apache.org/jira/browse/SPARK-32532 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Assignee: Apache Spark >Priority: Major > > Have some improvements for ORC file format to reduce time taken when > reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was > able to improve performance on branch-3.0 as follows (measurements in > seconds): > Read: > Nested Structs: 184 -> 44 > Array of Struct: 66 -> 15 > Write > Nested Structs: 543 -> 39 > Array of Struct: 330 -> 37 > Will be putting up the PR soon with the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats
[ https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated SPARK-32531: Description: We had found that Spark performance was slow as compared to PIG on some schemas in our pipelines. On investigation, it was found that Spark performance was slow for nested structs and array'd structs and these cases were not being profiled by the current benchmarks. I have some improvements for ORC (SPARK-32532) and Avro (SPARK-32533) file formats which improve the performance in these cases and will be putting up the PRs soon. (was: Additions to benchmarks for different file formats for nested structs and arrays which are not being currently benchmarked. I have some improvements for ORC and Avro file formats which improve the performance in these cases. I will be putting up the PRs soon.) > Add benchmarks for nested structs and arrays for different file formats > --- > > Key: SPARK-32531 > URL: https://issues.apache.org/jira/browse/SPARK-32531 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > We had found that Spark performance was slow as compared to PIG on some > schemas in our pipelines. On investigation, it was found that Spark > performance was slow for nested structs and array'd structs and these cases > were not being profiled by the current benchmarks. I have some improvements > for ORC (SPARK-32532) and Avro (SPARK-32533) file formats which improve the > performance in these cases and will be putting up the PRs soon. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats
[ https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171108#comment-17171108 ] Apache Spark commented on SPARK-32531: -- User 'msamirkhan' has created a pull request for this issue: https://github.com/apache/spark/pull/29352 > Add benchmarks for nested structs and arrays for different file formats > --- > > Key: SPARK-32531 > URL: https://issues.apache.org/jira/browse/SPARK-32531 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Additions to benchmarks for different file formats for nested structs and > arrays which are not being currently benchmarked. I have some improvements > for ORC and Avro file formats which improve the performance in these cases. > I will be putting up the PRs soon. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats
[ https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171106#comment-17171106 ] Apache Spark commented on SPARK-32531: -- User 'msamirkhan' has created a pull request for this issue: https://github.com/apache/spark/pull/29352 > Add benchmarks for nested structs and arrays for different file formats > --- > > Key: SPARK-32531 > URL: https://issues.apache.org/jira/browse/SPARK-32531 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Additions to benchmarks for different file formats for nested structs and > arrays which are not being currently benchmarked. I have some improvements > for ORC and Avro file formats which improve the performance in these cases. > I will be putting up the PRs soon. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats
[ https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32531: Assignee: Apache Spark > Add benchmarks for nested structs and arrays for different file formats > --- > > Key: SPARK-32531 > URL: https://issues.apache.org/jira/browse/SPARK-32531 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Assignee: Apache Spark >Priority: Major > > Additions to benchmarks for different file formats for nested structs and > arrays which are not being currently benchmarked. I have some improvements > for ORC and Avro file formats which improve the performance in these cases. > I will be putting up the PRs soon. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats
[ https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32531: Assignee: (was: Apache Spark) > Add benchmarks for nested structs and arrays for different file formats > --- > > Key: SPARK-32531 > URL: https://issues.apache.org/jira/browse/SPARK-32531 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Additions to benchmarks for different file formats for nested structs and > arrays which are not being currently benchmarked. I have some improvements > for ORC and Avro file formats which improve the performance in these cases. > I will be putting up the PRs soon. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats
[ https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated SPARK-32531: Summary: Add benchmarks for nested structs and arrays for different file formats (was: Add benchmarks for nested structs and arrays for different data types) > Add benchmarks for nested structs and arrays for different file formats > --- > > Key: SPARK-32531 > URL: https://issues.apache.org/jira/browse/SPARK-32531 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Additions to benchmarks for different file formats for nested structs and > arrays which are not being currently benchmarked. I have some improvements > for ORC and Avro file formats which improve the performance in these cases. > I will be putting up the PRs soon. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs
[ https://issues.apache.org/jira/browse/SPARK-32532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated SPARK-32532: Description: Have some improvements for ORC file format to reduce time taken when reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was able to improve performance on branch-3.0 as follows (measurements in seconds): Read: Nested Structs: 184 -> 44 Array of Struct: 66 -> 15 Write Nested Structs: 543 -> 39 Array of Struct: 330 -> 37 Will be putting up the PR soon with the changes. was: Have some improvements for ORC file format to reduce time taken when reading/writing nested/array'd structs. Using benchmarks in [SPARK-32531] was able to improve performance as follows (measurements in seconds): Read: Nested Structs: 184 -> 44 Array of Struct: 66 -> 15 Write Nested Structs: 543 -> 39 Array of Struct: 330 -> 37 Will be putting up the PR soon with the changes. > Improve ORC read/write performance on nested structs and array of structs > - > > Key: SPARK-32532 > URL: https://issues.apache.org/jira/browse/SPARK-32532 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Have some improvements for ORC file format to reduce time taken when > reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was > able to improve performance on branch-3.0 as follows (measurements in > seconds): > Read: > Nested Structs: 184 -> 44 > Array of Struct: 66 -> 15 > Write > Nested Structs: 543 -> 39 > Array of Struct: 330 -> 37 > Will be putting up the PR soon with the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32533) Improve Avro read/write performance on nested structs and array of structs
Muhammad Samir Khan created SPARK-32533: --- Summary: Improve Avro read/write performance on nested structs and array of structs Key: SPARK-32533 URL: https://issues.apache.org/jira/browse/SPARK-32533 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Muhammad Samir Khan Have some improvements for Avro file format to reduce time taken when reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was able to improve performance on branch-3.0 as follows (measurements in seconds): Read: Nested Structs: 75 -> 46 Array of Struct: 47 -> 17 Write Nested Structs: 147 -> 36 Array of Struct: 139 -> 34 Will be putting up the PR soon with the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs
[ https://issues.apache.org/jira/browse/SPARK-32532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Muhammad Samir Khan updated SPARK-32532: Description: Have some improvements for ORC file format to reduce time taken when reading/writing nested/array'd structs. Using benchmarks in [SPARK-32531] was able to improve performance as follows (measurements in seconds): Read: Nested Structs: 184 -> 44 Array of Struct: 66 -> 15 Write Nested Structs: 543 -> 39 Array of Struct: 330 -> 37 Will be putting up the PR soon with the changes. was: Have some improvements for ORC file format to reduce time taken when reading/writing nested/array'd structs. Using benchmarks in [SPARK-32071] was able to improve performance as follows (measurements in seconds): Read: Nested Structs: 184 -> 44 Array of Struct: 66 -> 15 Write Nested Structs: 543 -> 39 Array of Struct: 330 -> 37 Will be putting up the PR soon with the changes. > Improve ORC read/write performance on nested structs and array of structs > - > > Key: SPARK-32532 > URL: https://issues.apache.org/jira/browse/SPARK-32532 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Muhammad Samir Khan >Priority: Major > > Have some improvements for ORC file format to reduce time taken when > reading/writing nested/array'd structs. Using benchmarks in [SPARK-32531] was > able to improve performance as follows (measurements in seconds): > Read: > Nested Structs: 184 -> 44 > Array of Struct: 66 -> 15 > Write > Nested Structs: 543 -> 39 > Array of Struct: 330 -> 37 > Will be putting up the PR soon with the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs
Muhammad Samir Khan created SPARK-32532: --- Summary: Improve ORC read/write performance on nested structs and array of structs Key: SPARK-32532 URL: https://issues.apache.org/jira/browse/SPARK-32532 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Muhammad Samir Khan Have some improvements for ORC file format to reduce time taken when reading/writing nested/array'd structs. Using benchmarks in [SPARK-32071] was able to improve performance as follows (measurements in seconds): Read: Nested Structs: 184 -> 44 Array of Struct: 66 -> 15 Write Nested Structs: 543 -> 39 Array of Struct: 330 -> 37 Will be putting up the PR soon with the changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32531) Add benchmarks for nested structs and arrays for different data types
Muhammad Samir Khan created SPARK-32531: --- Summary: Add benchmarks for nested structs and arrays for different data types Key: SPARK-32531 URL: https://issues.apache.org/jira/browse/SPARK-32531 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: Muhammad Samir Khan Additions to benchmarks for different file formats for nested structs and arrays which are not being currently benchmarked. I have some improvements for ORC and Avro file formats which improve the performance in these cases. I will be putting up the PRs soon. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32530) SPIP: Kotlin support for Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pasha Finkeshteyn updated SPARK-32530: -- Issue Type: Improvement (was: Bug) > SPIP: Kotlin support for Apache Spark > - > > Key: SPARK-32530 > URL: https://issues.apache.org/jira/browse/SPARK-32530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Pasha Finkeshteyn >Priority: Major > > h2. Background and motivation > Kotlin is a cross-platform, statically typed, general-purpose JVM language. > In the last year more than 5 million developers have used Kotlin in mobile, > backend, frontend and scientific development. The number of Kotlin developers > grows rapidly every year. > * [According to > redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: > "Kotlin, the second fastest growing language we’ve seen outside of Swift, > made a big splash a year ago at this time when it vaulted eight full spots up > the list." > * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], > Kotlin is the second most popular language on the JVM > * [According to > StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share > increased by 7.8% in 2020. > We notice the increasing usage of Kotlin in data analysis ([6% of users in > 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to > 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in > 2019), and we expect these numbers to continue to grow. > We, authors of this SPIP, strongly believe that making Kotlin API officially > available to developers can bring new users to Apache Spark and help some of > the existing users. > h2. Goals > The goal of this project is to bring first-class support for Kotlin language > into the Apache Spark project. We’re going to achieve this by adding one more > module to the current Apache Spark distribution. > h2. Non-goals > There is no goal to replace any existing language support or to change any > existing Apache Spark API. > At this time, there is no goal to support non-core APIs of Apache Spark like > Spark ML and Spark structured streaming. This may change in the future based > on community feedback. > There is no goal to provide CLI for Kotlin for Apache Spark, this will be a > separate SPIP. > There is no goal to provide support for Apache Spark < 3.0.0. > h2. Current implementation > A working prototype is available at > [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside > JetBrains and by early adopters. > h2. What are the risks? > There is always a risk that this product won’t get enough popularity and will > bring more costs than benefits. It can be mitigated by the fact that we don't > need to change any existing API and support can be potentially dropped at any > time. > We also believe that existing API is rather low maintenance. It does not > bring anything more complex than already exists in the Spark codebase. > Furthermore, the implementation is compact - less than 2000 lines of code. > We are committed to maintaining, improving and evolving the API based on > feedback from both Spark and Kotlin communities. As the Kotlin data community > continues to grow, we see Kotlin API for Apache Spark as an important part in > the evolving Kotlin ecosystem, and intend to fully support it. > h2. How long will it take? > A working implementation is already available, and if the community will > have any proposal of changes for this implementation to be improved, these > can be implemented quickly — in weeks if not days. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32530) SPIP: Kotlin support for Apache Spark
Pasha Finkeshteyn created SPARK-32530: - Summary: SPIP: Kotlin support for Apache Spark Key: SPARK-32530 URL: https://issues.apache.org/jira/browse/SPARK-32530 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.1 Reporter: Pasha Finkeshteyn h2. Background and motivation Kotlin is a cross-platform, statically typed, general-purpose JVM language. In the last year more than 5 million developers have used Kotlin in mobile, backend, frontend and scientific development. The number of Kotlin developers grows rapidly every year. * [According to redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: "Kotlin, the second fastest growing language we’ve seen outside of Swift, made a big splash a year ago at this time when it vaulted eight full spots up the list." * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], Kotlin is the second most popular language on the JVM * [According to StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share increased by 7.8% in 2020. We notice the increasing usage of Kotlin in data analysis ([6% of users in 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in 2019), and we expect these numbers to continue to grow. We, authors of this SPIP, strongly believe that making Kotlin API officially available to developers can bring new users to Apache Spark and help some of the existing users. h2. Goals The goal of this project is to bring first-class support for Kotlin language into the Apache Spark project. We’re going to achieve this by adding one more module to the current Apache Spark distribution. h2. Non-goals There is no goal to replace any existing language support or to change any existing Apache Spark API. At this time, there is no goal to support non-core APIs of Apache Spark like Spark ML and Spark structured streaming. This may change in the future based on community feedback. There is no goal to provide CLI for Kotlin for Apache Spark, this will be a separate SPIP. There is no goal to provide support for Apache Spark < 3.0.0. h2. Current implementation A working prototype is available at [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside JetBrains and by early adopters. h2. What are the risks? There is always a risk that this product won’t get enough popularity and will bring more costs than benefits. It can be mitigated by the fact that we don't need to change any existing API and support can be potentially dropped at any time. We also believe that existing API is rather low maintenance. It does not bring anything more complex than already exists in the Spark codebase. Furthermore, the implementation is compact - less than 2000 lines of code. We are committed to maintaining, improving and evolving the API based on feedback from both Spark and Kotlin communities. As the Kotlin data community continues to grow, we see Kotlin API for Apache Spark as an important part in the evolving Kotlin ecosystem, and intend to fully support it. h2. How long will it take? A working implementation is already available, and if the community will have any proposal of changes for this implementation to be improved, these can be implemented quickly — in weeks if not days. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
[ https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-32003: - Fix Version/s: 2.4.7 > Shuffle files for lost executor are not unregistered if fetch failure occurs > after executor is lost > --- > > Key: SPARK-32003 > URL: https://issues.apache.org/jira/browse/SPARK-32003 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.4.6, 3.0.0 >Reporter: Wing Yew Poon >Assignee: Wing Yew Poon >Priority: Major > Fix For: 2.4.7, 3.0.1, 3.1.0 > > > A customer's cluster has a node that goes down while a Spark application is > running. (They are running Spark on YARN with the external shuffle service > enabled.) An executor is lost (apparently the only one running on the node). > This executor lost event is handled in the DAGScheduler, which removes the > executor from its BlockManagerMaster. At this point, there is no > unregistering of shuffle files for the executor or the node. Soon after, > tasks trying to fetch shuffle files output by that executor fail with > FetchFailed (because the node is down, there is no NodeManager available to > serve shuffle files). By right, such fetch failures should cause the shuffle > files for the executor to be unregistered, but they do not. > Due to task failure, the stage is re-attempted. Tasks continue to fail due to > fetch failure form the lost executor's shuffle output. This time, since the > failed epoch for the executor is higher, the executor is removed again (this > doesn't really do anything, the executor was already removed when it was > lost) and this time the shuffle output is unregistered. > So it takes two stage attempts instead of one to clear the shuffle output. We > get 4 attempts by default. The customer was unlucky and two nodes went down > during the stage, i.e., the same problem happened twice. So they used up 4 > stage attempts and the stage failed and thus the job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
[ https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171083#comment-17171083 ] Imran Rashid commented on SPARK-32003: -- Fixed in 2.4.7 by https://github.com/apache/spark/pull/29182 > Shuffle files for lost executor are not unregistered if fetch failure occurs > after executor is lost > --- > > Key: SPARK-32003 > URL: https://issues.apache.org/jira/browse/SPARK-32003 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.4.6, 3.0.0 >Reporter: Wing Yew Poon >Assignee: Wing Yew Poon >Priority: Major > Fix For: 2.4.7, 3.0.1, 3.1.0 > > > A customer's cluster has a node that goes down while a Spark application is > running. (They are running Spark on YARN with the external shuffle service > enabled.) An executor is lost (apparently the only one running on the node). > This executor lost event is handled in the DAGScheduler, which removes the > executor from its BlockManagerMaster. At this point, there is no > unregistering of shuffle files for the executor or the node. Soon after, > tasks trying to fetch shuffle files output by that executor fail with > FetchFailed (because the node is down, there is no NodeManager available to > serve shuffle files). By right, such fetch failures should cause the shuffle > files for the executor to be unregistered, but they do not. > Due to task failure, the stage is re-attempted. Tasks continue to fail due to > fetch failure form the lost executor's shuffle output. This time, since the > failed epoch for the executor is higher, the executor is removed again (this > doesn't really do anything, the executor was already removed when it was > lost) and this time the shuffle output is unregistered. > So it takes two stage attempts instead of one to clear the shuffle output. We > get 4 attempts by default. The customer was unlucky and two nodes went down > during the stage, i.e., the same problem happened twice. So they used up 4 > stage attempts and the stage failed and thus the job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32527) How to disable port 8080 in Spark?
[ https://issues.apache.org/jira/browse/SPARK-32527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171080#comment-17171080 ] Sean R. Owen commented on SPARK-32527: -- Yep, SO is a better place for questions. That said I think you just want spark.ui.enabled=false > How to disable port 8080 in Spark? > -- > > Key: SPARK-32527 > URL: https://issues.apache.org/jira/browse/SPARK-32527 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Fakrul Razi >Priority: Minor > > I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode. > By default when we start master and start slave, port 8080 and 8081 will open > automatically from SPARK application. > Due to security constraint, I would like to disable Spark web UI for master > (8080) and all workers (8081). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32527) How to disable port 8080 in Spark?
[ https://issues.apache.org/jira/browse/SPARK-32527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohit Mishra updated SPARK-32527: - Priority: Minor (was: Critical) > How to disable port 8080 in Spark? > -- > > Key: SPARK-32527 > URL: https://issues.apache.org/jira/browse/SPARK-32527 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Fakrul Razi >Priority: Minor > > I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode. > By default when we start master and start slave, port 8080 and 8081 will open > automatically from SPARK application. > Due to security constraint, I would like to disable Spark web UI for master > (8080) and all workers (8081). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32527) How to disable port 8080 in Spark?
[ https://issues.apache.org/jira/browse/SPARK-32527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohit Mishra resolved SPARK-32527. -- Resolution: Not A Problem > How to disable port 8080 in Spark? > -- > > Key: SPARK-32527 > URL: https://issues.apache.org/jira/browse/SPARK-32527 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Fakrul Razi >Priority: Critical > > I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode. > By default when we start master and start slave, port 8080 and 8081 will open > automatically from SPARK application. > Due to security constraint, I would like to disable Spark web UI for master > (8080) and all workers (8081). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32527) How to disable port 8080 in Spark?
[ https://issues.apache.org/jira/browse/SPARK-32527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171069#comment-17171069 ] Rohit Mishra commented on SPARK-32527: -- [~shinudin], * Please use stack overflow for any question. Kindly check this document to understand the best practices - http://spark.apache.org/community.html * Please don't mark priority as critical, these are mostly used by committers. * This issue will be marked resolved for now. Thanks. > How to disable port 8080 in Spark? > -- > > Key: SPARK-32527 > URL: https://issues.apache.org/jira/browse/SPARK-32527 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Fakrul Razi >Priority: Critical > > I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode. > By default when we start master and start slave, port 8080 and 8081 will open > automatically from SPARK application. > Due to security constraint, I would like to disable Spark web UI for master > (8080) and all workers (8081). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32527) How to disable port 8080 in Spark?
[ https://issues.apache.org/jira/browse/SPARK-32527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171069#comment-17171069 ] Rohit Mishra edited comment on SPARK-32527 at 8/4/20, 7:16 PM: --- [~shinudin], * Please use stack overflow for any question or use User mail list. Kindly check this document to understand the best practices - [http://spark.apache.org/community.html] * Please don't mark priority as critical, these are mostly used by committers. * This issue will be marked resolved for now. Thanks. was (Author: rohitmishr1484): [~shinudin], * Please use stack overflow for any question. Kindly check this document to understand the best practices - http://spark.apache.org/community.html * Please don't mark priority as critical, these are mostly used by committers. * This issue will be marked resolved for now. Thanks. > How to disable port 8080 in Spark? > -- > > Key: SPARK-32527 > URL: https://issues.apache.org/jira/browse/SPARK-32527 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Fakrul Razi >Priority: Critical > > I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode. > By default when we start master and start slave, port 8080 and 8081 will open > automatically from SPARK application. > Due to security constraint, I would like to disable Spark web UI for master > (8080) and all workers (8081). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32528) The analyze method should make sure the plan is analyzed
[ https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171062#comment-17171062 ] Rohit Mishra commented on SPARK-32528: -- [~cloud_fan], Can you please populate the description field? > The analyze method should make sure the plan is analyzed > > > Key: SPARK-32528 > URL: https://issues.apache.org/jira/browse/SPARK-32528 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32427) Omit USING in CREATE TABLE via JDBC Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-32427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171020#comment-17171020 ] Huaxin Gao commented on SPARK-32427: [~maxgekk] I made the tableProvider to be optional to omit USING, but now it seems the CREATE TABLE SYNTAX has ambiguity. For example, the following CREATE TABLE is supposed to create a Hive table, but after my change it creates a data source table. {code:java} s"""CREATE TABLE t1 ( | c1 INT COMMENT 'bla', | c2 STRING |) |TBLPROPERTIES ( | 'prop1' = 'value1', | 'prop2' = 'value2' |) """.stripMargin {code} > Omit USING in CREATE TABLE via JDBC Table Catalog > - > > Key: SPARK-32427 > URL: https://issues.apache.org/jira/browse/SPARK-32427 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Support creating tables in JDBC Table Catalog without USING, for instance: > {code:sql} > CREATE TABLE h2.test.new_table(i INT, j STRING) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
[ https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170945#comment-17170945 ] Imran Rashid commented on SPARK-32003: -- Fixed in 3.0.1 by https://github.com/apache/spark/pull/29193 > Shuffle files for lost executor are not unregistered if fetch failure occurs > after executor is lost > --- > > Key: SPARK-32003 > URL: https://issues.apache.org/jira/browse/SPARK-32003 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.4.6, 3.0.0 >Reporter: Wing Yew Poon >Assignee: Wing Yew Poon >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > A customer's cluster has a node that goes down while a Spark application is > running. (They are running Spark on YARN with the external shuffle service > enabled.) An executor is lost (apparently the only one running on the node). > This executor lost event is handled in the DAGScheduler, which removes the > executor from its BlockManagerMaster. At this point, there is no > unregistering of shuffle files for the executor or the node. Soon after, > tasks trying to fetch shuffle files output by that executor fail with > FetchFailed (because the node is down, there is no NodeManager available to > serve shuffle files). By right, such fetch failures should cause the shuffle > files for the executor to be unregistered, but they do not. > Due to task failure, the stage is re-attempted. Tasks continue to fail due to > fetch failure form the lost executor's shuffle output. This time, since the > failed epoch for the executor is higher, the executor is removed again (this > doesn't really do anything, the executor was already removed when it was > lost) and this time the shuffle output is unregistered. > So it takes two stage attempts instead of one to clear the shuffle output. We > get 4 attempts by default. The customer was unlucky and two nodes went down > during the stage, i.e., the same problem happened twice. So they used up 4 > stage attempts and the stage failed and thus the job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
[ https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-32003: - Fix Version/s: 3.0.1 > Shuffle files for lost executor are not unregistered if fetch failure occurs > after executor is lost > --- > > Key: SPARK-32003 > URL: https://issues.apache.org/jira/browse/SPARK-32003 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.4.6, 3.0.0 >Reporter: Wing Yew Poon >Assignee: Wing Yew Poon >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > A customer's cluster has a node that goes down while a Spark application is > running. (They are running Spark on YARN with the external shuffle service > enabled.) An executor is lost (apparently the only one running on the node). > This executor lost event is handled in the DAGScheduler, which removes the > executor from its BlockManagerMaster. At this point, there is no > unregistering of shuffle files for the executor or the node. Soon after, > tasks trying to fetch shuffle files output by that executor fail with > FetchFailed (because the node is down, there is no NodeManager available to > serve shuffle files). By right, such fetch failures should cause the shuffle > files for the executor to be unregistered, but they do not. > Due to task failure, the stage is re-attempted. Tasks continue to fail due to > fetch failure form the lost executor's shuffle output. This time, since the > failed epoch for the executor is higher, the executor is removed again (this > doesn't really do anything, the executor was already removed when it was > lost) and this time the shuffle output is unregistered. > So it takes two stage attempts instead of one to clear the shuffle output. We > get 4 attempts by default. The customer was unlucky and two nodes went down > during the stage, i.e., the same problem happened twice. So they used up 4 > stage attempts and the stage failed and thus the job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32529) Spark 3.0 History Server May Never Finish One Round Log Dir Scan
[ https://issues.apache.org/jira/browse/SPARK-32529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32529: Assignee: (was: Apache Spark) > Spark 3.0 History Server May Never Finish One Round Log Dir Scan > > > Key: SPARK-32529 > URL: https://issues.apache.org/jira/browse/SPARK-32529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yan Xiaole >Priority: Major > > If there are a large number (>100k) of applications log dir, listing the log > dir will take a few seconds. After getting the path list some applications > might have finished already, and the filename will change from > "foo.inprogress" to "foo". > It leads to a problem when adding an entry to the listing, querying file > status like `fileSizeForLastIndex` will throw out a `FileNotFoundException` > exception if the application was finished. And the exception will abort > current loop, in a busy cluster, it will make history server couldn't list > and load any application log. > > > {code:java} > 20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event > log updates > java.io.FileNotFoundException: File does not exist: > hdfs://xx/logs/spark/application_11.lz4.inprogress > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520) > at > org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170) > at > org.apache.spark.deploy.history.SingleFileEventLogFileReader.status(EventLogFileReaders.scala:170) > at > org.apache.spark.deploy.history.SingleFileEventLogFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:174) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:523) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466) > at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) > at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) > at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:347) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:347) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287) > at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32529) Spark 3.0 History Server May Never Finish One Round Log Dir Scan
[ https://issues.apache.org/jira/browse/SPARK-32529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32529: Assignee: Apache Spark > Spark 3.0 History Server May Never Finish One Round Log Dir Scan > > > Key: SPARK-32529 > URL: https://issues.apache.org/jira/browse/SPARK-32529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yan Xiaole >Assignee: Apache Spark >Priority: Major > > If there are a large number (>100k) of applications log dir, listing the log > dir will take a few seconds. After getting the path list some applications > might have finished already, and the filename will change from > "foo.inprogress" to "foo". > It leads to a problem when adding an entry to the listing, querying file > status like `fileSizeForLastIndex` will throw out a `FileNotFoundException` > exception if the application was finished. And the exception will abort > current loop, in a busy cluster, it will make history server couldn't list > and load any application log. > > > {code:java} > 20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event > log updates > java.io.FileNotFoundException: File does not exist: > hdfs://xx/logs/spark/application_11.lz4.inprogress > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520) > at > org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170) > at > org.apache.spark.deploy.history.SingleFileEventLogFileReader.status(EventLogFileReaders.scala:170) > at > org.apache.spark.deploy.history.SingleFileEventLogFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:174) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:523) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466) > at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) > at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) > at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:347) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:347) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287) > at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32529) Spark 3.0 History Server May Never Finish One Round Log Dir Scan
[ https://issues.apache.org/jira/browse/SPARK-32529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170926#comment-17170926 ] Apache Spark commented on SPARK-32529: -- User 'yanxiaole' has created a pull request for this issue: https://github.com/apache/spark/pull/29350 > Spark 3.0 History Server May Never Finish One Round Log Dir Scan > > > Key: SPARK-32529 > URL: https://issues.apache.org/jira/browse/SPARK-32529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yan Xiaole >Priority: Major > > If there are a large number (>100k) of applications log dir, listing the log > dir will take a few seconds. After getting the path list some applications > might have finished already, and the filename will change from > "foo.inprogress" to "foo". > It leads to a problem when adding an entry to the listing, querying file > status like `fileSizeForLastIndex` will throw out a `FileNotFoundException` > exception if the application was finished. And the exception will abort > current loop, in a busy cluster, it will make history server couldn't list > and load any application log. > > > {code:java} > 20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event > log updates > java.io.FileNotFoundException: File does not exist: > hdfs://xx/logs/spark/application_11.lz4.inprogress > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520) > at > org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170) > at > org.apache.spark.deploy.history.SingleFileEventLogFileReader.status(EventLogFileReaders.scala:170) > at > org.apache.spark.deploy.history.SingleFileEventLogFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:174) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:523) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466) > at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) > at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) > at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:347) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:347) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287) > at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32529) Spark 3.0 History Server May Never Finish One Round Log Dir Scan
[ https://issues.apache.org/jira/browse/SPARK-32529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170925#comment-17170925 ] Apache Spark commented on SPARK-32529: -- User 'yanxiaole' has created a pull request for this issue: https://github.com/apache/spark/pull/29350 > Spark 3.0 History Server May Never Finish One Round Log Dir Scan > > > Key: SPARK-32529 > URL: https://issues.apache.org/jira/browse/SPARK-32529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yan Xiaole >Priority: Major > > If there are a large number (>100k) of applications log dir, listing the log > dir will take a few seconds. After getting the path list some applications > might have finished already, and the filename will change from > "foo.inprogress" to "foo". > It leads to a problem when adding an entry to the listing, querying file > status like `fileSizeForLastIndex` will throw out a `FileNotFoundException` > exception if the application was finished. And the exception will abort > current loop, in a busy cluster, it will make history server couldn't list > and load any application log. > > > {code:java} > 20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event > log updates > java.io.FileNotFoundException: File does not exist: > hdfs://xx/logs/spark/application_11.lz4.inprogress > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520) > at > org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170) > at > org.apache.spark.deploy.history.SingleFileEventLogFileReader.status(EventLogFileReaders.scala:170) > at > org.apache.spark.deploy.history.SingleFileEventLogFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:174) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:523) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466) > at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) > at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) > at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:347) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:347) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287) > at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32525) The layout of monitoring.html is broken
[ https://issues.apache.org/jira/browse/SPARK-32525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-32525. Resolution: Fixed This issue is resolved in https://github.com/apache/spark/pull/29345 > The layout of monitoring.html is broken > --- > > Key: SPARK-32525 > URL: https://issues.apache.org/jira/browse/SPARK-32525 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > The layout of monitoring.html is broken because there are 2 tags not > closed in monitoring.md. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32529) Spark 3.0 History Server May Never Finish One Round Log Dir Scan
Yan Xiaole created SPARK-32529: -- Summary: Spark 3.0 History Server May Never Finish One Round Log Dir Scan Key: SPARK-32529 URL: https://issues.apache.org/jira/browse/SPARK-32529 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Yan Xiaole If there are a large number (>100k) of applications log dir, listing the log dir will take a few seconds. After getting the path list some applications might have finished already, and the filename will change from "foo.inprogress" to "foo". It leads to a problem when adding an entry to the listing, querying file status like `fileSizeForLastIndex` will throw out a `FileNotFoundException` exception if the application was finished. And the exception will abort current loop, in a busy cluster, it will make history server couldn't list and load any application log. {code:java} 20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event log updates java.io.FileNotFoundException: File does not exist: hdfs://xx/logs/spark/application_11.lz4.inprogress at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527) at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520) at org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170) at org.apache.spark.deploy.history.SingleFileEventLogFileReader.status(EventLogFileReaders.scala:170) at org.apache.spark.deploy.history.SingleFileEventLogFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:174) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:523) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466) at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255) at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249) at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at scala.collection.TraversableLike.filter(TraversableLike.scala:347) at scala.collection.TraversableLike.filter$(TraversableLike.scala:347) at scala.collection.AbstractTraversable.filter(Traversable.scala:108) at org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287) at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302) at org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32526) Let sql/catalyst module tests pass for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-32526: - Summary: Let sql/catalyst module tests pass for Scala 2.13 (was: Let sql/catalyst module compile for Scala 2.13) > Let sql/catalyst module tests pass for Scala 2.13 > - > > Key: SPARK-32526 > URL: https://issues.apache.org/jira/browse/SPARK-32526 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yang Jie >Priority: Minor > Attachments: catalyst-failed-cases > > > sql/catalyst module has following compile errors with scala-2.13 profile: > {code:java} > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > {code} > Similar as https://issues.apache.org/jira/browse/SPARK-29292 , call .toSeq > on these to ensue they still works on 2.12. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32499) Use {} for structs and maps in show()
[ https://issues.apache.org/jira/browse/SPARK-32499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32499. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29308 [https://github.com/apache/spark/pull/29308] > Use {} for structs and maps in show() > - > > Key: SPARK-32499 > URL: https://issues.apache.org/jira/browse/SPARK-32499 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.1.0 > > > Currently, show() wraps arrays, maps and structs by []. Maps and structs > should be wrapped by {}: > - To be consistent with ToHiveResult > - To distinguish maps/structs from arrays -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32499) Use {} for structs and maps in show()
[ https://issues.apache.org/jira/browse/SPARK-32499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32499: --- Assignee: Maxim Gekk > Use {} for structs and maps in show() > - > > Key: SPARK-32499 > URL: https://issues.apache.org/jira/browse/SPARK-32499 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > Currently, show() wraps arrays, maps and structs by []. Maps and structs > should be wrapped by {}: > - To be consistent with ToHiveResult > - To distinguish maps/structs from arrays -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32528) The analyze method should make sure the plan is analyzed
[ https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170843#comment-17170843 ] Apache Spark commented on SPARK-32528: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29349 > The analyze method should make sure the plan is analyzed > > > Key: SPARK-32528 > URL: https://issues.apache.org/jira/browse/SPARK-32528 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32528) The analyze method should make sure the plan is analyzed
[ https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32528: Assignee: Apache Spark (was: Wenchen Fan) > The analyze method should make sure the plan is analyzed > > > Key: SPARK-32528 > URL: https://issues.apache.org/jira/browse/SPARK-32528 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32528) The analyze method should make sure the plan is analyzed
[ https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32528: Assignee: Wenchen Fan (was: Apache Spark) > The analyze method should make sure the plan is analyzed > > > Key: SPARK-32528 > URL: https://issues.apache.org/jira/browse/SPARK-32528 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32528) The analyze method should make sure the plan is analyzed
[ https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170842#comment-17170842 ] Apache Spark commented on SPARK-32528: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29349 > The analyze method should make sure the plan is analyzed > > > Key: SPARK-32528 > URL: https://issues.apache.org/jira/browse/SPARK-32528 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32528) The analyze method should make sure the plan is analyzed
[ https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-32528: Summary: The analyze method should make sure the plan is analyzed (was: The analyze method make sure the plan is analyzed) > The analyze method should make sure the plan is analyzed > > > Key: SPARK-32528 > URL: https://issues.apache.org/jira/browse/SPARK-32528 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32528) The analyze method make sure the plan is analyzed
[ https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32528: --- Assignee: Wenchen Fan > The analyze method make sure the plan is analyzed > - > > Key: SPARK-32528 > URL: https://issues.apache.org/jira/browse/SPARK-32528 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32528) The analyze method make sure the plan is analyzed
Wenchen Fan created SPARK-32528: --- Summary: The analyze method make sure the plan is analyzed Key: SPARK-32528 URL: https://issues.apache.org/jira/browse/SPARK-32528 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.1.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation
[ https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170817#comment-17170817 ] Thomas Graves commented on SPARK-32037: --- allowlist and blocklist have been used by others. Seems we may only need blocklist. I'm hesitant with healthtracker as it could be used for other health checks but it does sound better. [https://github.com/golang/go/commit/608cdcaede1e7133dc994b5e8894272c2dce744b] [https://9to5google.com/2020/06/12/google-android-chrome-blacklist-blocklist-more-inclusive/] [https://bugzilla.mozilla.org/show_bug.cgi?id=1571734] DenyList: https://issues.apache.org/jira/browse/GEODE-5685 [https://github.com/nodejs/node/pull/33813] > Rename blacklisting feature to avoid language with racist connotation > - > > Key: SPARK-32037 > URL: https://issues.apache.org/jira/browse/SPARK-32037 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Erik Krogen >Priority: Minor > > As per [discussion on the Spark dev > list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E], > it will be beneficial to remove references to problematic language that can > alienate potential community members. One such reference is "blacklist". > While it seems to me that there is some valid debate as to whether this term > has racist origins, the cultural connotations are inescapable in today's > world. > I've created a separate task, SPARK-32036, to remove references outside of > this feature. Given the large surface area of this feature and the > public-facing UI / configs / etc., more care will need to be taken here. > I'd like to start by opening up debate on what the best replacement name > would be. Reject-/deny-/ignore-/block-list are common replacements for > "blacklist", but I'm not sure that any of them work well for this situation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23431) Expose the new executor memory metrics at the stage level
[ https://issues.apache.org/jira/browse/SPARK-23431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-23431. Resolution: Fixed The issue is resolved in https://github.com/apache/spark/pull/29020 > Expose the new executor memory metrics at the stage level > - > > Key: SPARK-23431 > URL: https://issues.apache.org/jira/browse/SPARK-23431 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Edward Lu >Assignee: Terry Kim >Priority: Major > > Collect and show the new executor memory metrics for each stage, to provide > more information on how memory is used per stage. > Modify the AppStatusListener to track the peak values for JVM used memory, > execution memory, storage memory, and unified memory for each executor for > each stage. > This is a subtask for SPARK-23206. Please refer to the design doc for that > ticket for more details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23431) Expose the new executor memory metrics at the stage level
[ https://issues.apache.org/jira/browse/SPARK-23431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-23431: -- Assignee: Terry Kim > Expose the new executor memory metrics at the stage level > - > > Key: SPARK-23431 > URL: https://issues.apache.org/jira/browse/SPARK-23431 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Edward Lu >Assignee: Terry Kim >Priority: Major > > Collect and show the new executor memory metrics for each stage, to provide > more information on how memory is used per stage. > Modify the AppStatusListener to track the peak values for JVM used memory, > execution memory, storage memory, and unified memory for each executor for > each stage. > This is a subtask for SPARK-23206. Please refer to the design doc for that > ticket for more details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-32526: - Parent: SPARK-25075 Issue Type: Sub-task (was: Task) > Let sql/catalyst module compile for Scala 2.13 > -- > > Key: SPARK-32526 > URL: https://issues.apache.org/jira/browse/SPARK-32526 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yang Jie >Priority: Minor > Attachments: catalyst-failed-cases > > > sql/catalyst module has following compile errors with scala-2.13 profile: > {code:java} > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > {code} > Similar as https://issues.apache.org/jira/browse/SPARK-29292 , call .toSeq > on these to ensue they still works on 2.12. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170754#comment-17170754 ] Sean R. Owen commented on SPARK-32526: -- Yes, I know [~dongjoon] has also been working on tests and has up to core passing, so this would be next. > Let sql/catalyst module compile for Scala 2.13 > -- > > Key: SPARK-32526 > URL: https://issues.apache.org/jira/browse/SPARK-32526 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yang Jie >Priority: Minor > Attachments: catalyst-failed-cases > > > sql/catalyst module has following compile errors with scala-2.13 profile: > {code:java} > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > {code} > Similar as https://issues.apache.org/jira/browse/SPARK-29292 , call .toSeq > on these to ensue they still works on 2.12. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32524) SharedSparkSession should clean up InMemoryRelation.ser
[ https://issues.apache.org/jira/browse/SPARK-32524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32524: Assignee: Dongjoon Hyun > SharedSparkSession should clean up InMemoryRelation.ser > > > Key: SPARK-32524 > URL: https://issues.apache.org/jira/browse/SPARK-32524 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32524) SharedSparkSession should clean up InMemoryRelation.ser
[ https://issues.apache.org/jira/browse/SPARK-32524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32524. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29346 [https://github.com/apache/spark/pull/29346] > SharedSparkSession should clean up InMemoryRelation.ser > > > Key: SPARK-32524 > URL: https://issues.apache.org/jira/browse/SPARK-32524 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32527) How to disable port 8080 in Spark?
[ https://issues.apache.org/jira/browse/SPARK-32527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fakrul Razi updated SPARK-32527: Description: I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode. By default when we start master and start slave, port 8080 and 8081 will open automatically from SPARK application. Due to security constraint, I would like to disable Spark web UI for master (8080) and all workers (8081). was: I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode. I want to disable Spark web UI for master (8080) and all workers (8081). > How to disable port 8080 in Spark? > -- > > Key: SPARK-32527 > URL: https://issues.apache.org/jira/browse/SPARK-32527 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Fakrul Razi >Priority: Critical > > I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode. > By default when we start master and start slave, port 8080 and 8081 will open > automatically from SPARK application. > Due to security constraint, I would like to disable Spark web UI for master > (8080) and all workers (8081). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32527) How to disable port 8080 in Spark?
Fakrul Razi created SPARK-32527: --- Summary: How to disable port 8080 in Spark? Key: SPARK-32527 URL: https://issues.apache.org/jira/browse/SPARK-32527 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 2.3.3 Reporter: Fakrul Razi I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode. I want to disable Spark web UI for master (8080) and all workers (8081). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170640#comment-17170640 ] Yang Jie edited comment on SPARK-32526 at 8/4/20, 8:01 AM: --- There are 97 test case FAILED and 3 test case ABORTED in sql/catalyst module with scala-2.13 profile, the failures list is in the attachment, which needs to be fixed later. In addition, encodedecodetest in RowEncoderSuite is ignored because it will generate a large number of error characters. was (Author: luciferyang): There are 97 test caseFAILED and 3 test case ABORTED in sql/catalyst module with scala-2.13 profile, the failures list is in the attachment, which needs to be fixed later. In addition, encodedecodetest in RowEncoderSuite is ignored because it will generate a large number of error characters. > Let sql/catalyst module compile for Scala 2.13 > -- > > Key: SPARK-32526 > URL: https://issues.apache.org/jira/browse/SPARK-32526 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yang Jie >Priority: Minor > Attachments: catalyst-failed-cases > > > sql/catalyst module has following compile errors with scala-2.13 profile: > {code:java} > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > {code} > Similar as https://issues.apache.org/jira/browse/SPARK-29292 , call .toSeq > on these to ensue they still works on 2.12. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170640#comment-17170640 ] Yang Jie commented on SPARK-32526: -- There are 97 test caseFAILED and 3 test case ABORTED in sql/catalyst module with scala-2.13 profile, the failures list is in the attachment, which needs to be fixed later. In addition, encodedecodetest in RowEncoderSuite is ignored because it will generate a large number of error characters. > Let sql/catalyst module compile for Scala 2.13 > -- > > Key: SPARK-32526 > URL: https://issues.apache.org/jira/browse/SPARK-32526 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yang Jie >Priority: Minor > Attachments: catalyst-failed-cases > > > sql/catalyst module has following compile errors with scala-2.13 profile: > {code:java} > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > {code} > Similar as https://issues.apache.org/jira/browse/SPARK-29292 , call .toSeq > on these to ensue they still works on 2.12. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-32526: - Attachment: catalyst-failed-cases > Let sql/catalyst module compile for Scala 2.13 > -- > > Key: SPARK-32526 > URL: https://issues.apache.org/jira/browse/SPARK-32526 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yang Jie >Priority: Minor > Attachments: catalyst-failed-cases > > > sql/catalyst module has following compile errors with scala-2.13 profile: > {code:java} > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > {code} > Similar as https://issues.apache.org/jira/browse/SPARK-29292 , call .toSeq > on these to ensue they still works on 2.12. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170626#comment-17170626 ] Yang Jie commented on SPARK-32526: -- [~srowen] Can this issue be a sub-task of https://issues.apache.org/jira/browse/SPARK-25075? > Let sql/catalyst module compile for Scala 2.13 > -- > > Key: SPARK-32526 > URL: https://issues.apache.org/jira/browse/SPARK-32526 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yang Jie >Priority: Minor > > sql/catalyst module has following compile errors with scala-2.13 profile: > {code:java} > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > {code} > Similar as https://issues.apache.org/jira/browse/SPARK-29292 , call .toSeq > on these to ensue they still works on 2.12. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32526: Assignee: (was: Apache Spark) > Let sql/catalyst module compile for Scala 2.13 > -- > > Key: SPARK-32526 > URL: https://issues.apache.org/jira/browse/SPARK-32526 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yang Jie >Priority: Minor > > sql/catalyst module has following compile errors with scala-2.13 profile: > {code:java} > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > {code} > Similar as https://issues.apache.org/jira/browse/SPARK-29292 , call .toSeq > on these to ensue they still works on 2.12. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170625#comment-17170625 ] Apache Spark commented on SPARK-32526: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/29348 > Let sql/catalyst module compile for Scala 2.13 > -- > > Key: SPARK-32526 > URL: https://issues.apache.org/jira/browse/SPARK-32526 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yang Jie >Priority: Minor > > sql/catalyst module has following compile errors with scala-2.13 profile: > {code:java} > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > {code} > Similar as https://issues.apache.org/jira/browse/SPARK-29292 , call .toSeq > on these to ensue they still works on 2.12. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32526: Assignee: Apache Spark > Let sql/catalyst module compile for Scala 2.13 > -- > > Key: SPARK-32526 > URL: https://issues.apache.org/jira/browse/SPARK-32526 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > sql/catalyst module has following compile errors with scala-2.13 profile: > {code:java} > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > {code} > Similar as https://issues.apache.org/jira/browse/SPARK-29292 , call .toSeq > on these to ensue they still works on 2.12. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170624#comment-17170624 ] Apache Spark commented on SPARK-32526: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/29348 > Let sql/catalyst module compile for Scala 2.13 > -- > > Key: SPARK-32526 > URL: https://issues.apache.org/jira/browse/SPARK-32526 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yang Jie >Priority: Minor > > sql/catalyst module has following compile errors with scala-2.13 profile: > {code:java} > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] > [INFO] [Info] : > scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)] <: > Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, > org.apache.spark.sql.catalyst.expressions.Attribute)]? > [INFO] [Info] : false > [ERROR] [Error] > /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952: > type mismatch; > found : > scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] > {code} > Similar as https://issues.apache.org/jira/browse/SPARK-29292 , call .toSeq > on these to ensue they still works on 2.12. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32492) Fulfill missing column meta information for thrift server client tools
[ https://issues.apache.org/jira/browse/SPARK-32492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170619#comment-17170619 ] Apache Spark commented on SPARK-32492: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/29347 > Fulfill missing column meta information for thrift server client tools > -- > > Key: SPARK-32492 > URL: https://issues.apache.org/jira/browse/SPARK-32492 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.1.0 > > Attachments: wx20200730-175...@2x.png > > > Most fields of a column are missing, e.g. position, column-size > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13
Yang Jie created SPARK-32526: Summary: Let sql/catalyst module compile for Scala 2.13 Key: SPARK-32526 URL: https://issues.apache.org/jira/browse/SPARK-32526 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.0.0 Reporter: Yang Jie sql/catalyst module has following compile errors with scala-2.13 profile: {code:java} [ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284: type mismatch; found : scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, org.apache.spark.sql.catalyst.expressions.Attribute)] required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, org.apache.spark.sql.catalyst.expressions.Attribute)] [INFO] [Info] : scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, org.apache.spark.sql.catalyst.expressions.Attribute)] <: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, org.apache.spark.sql.catalyst.expressions.Attribute)]? [INFO] [Info] : false [ERROR] [Error] /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289: type mismatch; found : scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, org.apache.spark.sql.catalyst.expressions.Attribute)] required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)] [INFO] [Info] : scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, org.apache.spark.sql.catalyst.expressions.Attribute)] <: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]? [INFO] [Info] : false [ERROR] [Error] /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297: type mismatch; found : scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, org.apache.spark.sql.catalyst.expressions.Attribute)] required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, org.apache.spark.sql.catalyst.expressions.Attribute)] [INFO] [Info] : scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute, org.apache.spark.sql.catalyst.expressions.Attribute)] <: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, org.apache.spark.sql.catalyst.expressions.Attribute)]? [INFO] [Info] : false [ERROR] [Error] /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952: type mismatch; found : scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan] {code} Similar as https://issues.apache.org/jira/browse/SPARK-29292 , call .toSeq on these to ensue they still works on 2.12. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32492) Fulfill missing column meta information for thrift server client tools
[ https://issues.apache.org/jira/browse/SPARK-32492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170618#comment-17170618 ] Apache Spark commented on SPARK-32492: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/29347 > Fulfill missing column meta information for thrift server client tools > -- > > Key: SPARK-32492 > URL: https://issues.apache.org/jira/browse/SPARK-32492 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.1.0 > > Attachments: wx20200730-175...@2x.png > > > Most fields of a column are missing, e.g. position, column-size > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org