date:20200804

[jira] [Commented] (SPARK-32534) Cannot load a Pipeline Model on a stopped Spark Context

2020-08-04 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171294#comment-17171294
 ] 

Hyukjin Kwon commented on SPARK-32534:
--

[~kvanlieshout]can you show the full steps to reproduce and the error messages 
you got?

> Cannot load a Pipeline Model on a stopped Spark Context
> ---
>
> Key: SPARK-32534
> URL: https://issues.apache.org/jira/browse/SPARK-32534
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Kubernetes
>Affects Versions: 2.4.6
>Reporter: Kevin Van Lieshout
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am running Spark in a Kubernetes cluster than is running Spark NLP using 
> the Pyspark ML Pipeline Model class to load the model and then transform on 
> the spark dataframe. We run this within a docker container that starts up a 
> spark context, mounts volumes, spins up executors, etc and then does it 
> transformations, udfs, etc and then closes down the spark context. The first 
> time I load the model when my service has just been started, everything is 
> fine. If I run my application for a second time without resetting my service, 
> even though the context is entirely stopped from the previous run and a new 
> one is started up, the Pipeline Model has some attribute in one of its base 
> classes that thinks the context its running on is closed, so then I get a : 
> cannot call a function on a stopped spark context when I try and load the 
> model in my service again. I have to shut down my service each time if I want 
> consecutive runs through my spark pipeline, which is not ideal, so I was 
> wondering if this was a common issue amongst fellow pyspark users that use 
> Pipeline Model, or is there a common work around to resetting all spark 
> contexts or whether the pipeline model caches a spark context of some sort. 
> Any help is very useful. 
>  
>  
> cls.pipeline = PipelineModel.read().load(NLP_MODEL)
>  
> is how I load the model. And our spark context is very similar to a typical 
> kubernetes/spark setup. Nothing special there



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32534) Cannot load a Pipeline Model on a stopped Spark Context

2020-08-04 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171293#comment-17171293
 ] 

Hyukjin Kwon commented on SPARK-32534:
--

Please avoid setting Blocker+ for Priority which is usually reserved for 
committers.

> Cannot load a Pipeline Model on a stopped Spark Context
> ---
>
> Key: SPARK-32534
> URL: https://issues.apache.org/jira/browse/SPARK-32534
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Kubernetes
>Affects Versions: 2.4.6
>Reporter: Kevin Van Lieshout
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am running Spark in a Kubernetes cluster than is running Spark NLP using 
> the Pyspark ML Pipeline Model class to load the model and then transform on 
> the spark dataframe. We run this within a docker container that starts up a 
> spark context, mounts volumes, spins up executors, etc and then does it 
> transformations, udfs, etc and then closes down the spark context. The first 
> time I load the model when my service has just been started, everything is 
> fine. If I run my application for a second time without resetting my service, 
> even though the context is entirely stopped from the previous run and a new 
> one is started up, the Pipeline Model has some attribute in one of its base 
> classes that thinks the context its running on is closed, so then I get a : 
> cannot call a function on a stopped spark context when I try and load the 
> model in my service again. I have to shut down my service each time if I want 
> consecutive runs through my spark pipeline, which is not ideal, so I was 
> wondering if this was a common issue amongst fellow pyspark users that use 
> Pipeline Model, or is there a common work around to resetting all spark 
> contexts or whether the pipeline model caches a spark context of some sort. 
> Any help is very useful. 
>  
>  
> cls.pipeline = PipelineModel.read().load(NLP_MODEL)
>  
> is how I load the model. And our spark context is very similar to a typical 
> kubernetes/spark setup. Nothing special there



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32534) Cannot load a Pipeline Model on a stopped Spark Context

2020-08-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32534:
-
Priority: Minor  (was: Blocker)

> Cannot load a Pipeline Model on a stopped Spark Context
> ---
>
> Key: SPARK-32534
> URL: https://issues.apache.org/jira/browse/SPARK-32534
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Kubernetes
>Affects Versions: 2.4.6
>Reporter: Kevin Van Lieshout
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I am running Spark in a Kubernetes cluster than is running Spark NLP using 
> the Pyspark ML Pipeline Model class to load the model and then transform on 
> the spark dataframe. We run this within a docker container that starts up a 
> spark context, mounts volumes, spins up executors, etc and then does it 
> transformations, udfs, etc and then closes down the spark context. The first 
> time I load the model when my service has just been started, everything is 
> fine. If I run my application for a second time without resetting my service, 
> even though the context is entirely stopped from the previous run and a new 
> one is started up, the Pipeline Model has some attribute in one of its base 
> classes that thinks the context its running on is closed, so then I get a : 
> cannot call a function on a stopped spark context when I try and load the 
> model in my service again. I have to shut down my service each time if I want 
> consecutive runs through my spark pipeline, which is not ideal, so I was 
> wondering if this was a common issue amongst fellow pyspark users that use 
> Pipeline Model, or is there a common work around to resetting all spark 
> contexts or whether the pipeline model caches a spark context of some sort. 
> Any help is very useful. 
>  
>  
> cls.pipeline = PipelineModel.read().load(NLP_MODEL)
>  
> is how I load the model. And our spark context is very similar to a typical 
> kubernetes/spark setup. Nothing special there



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32535) Query with broadcast hints fail when query has a WITH clause

2020-08-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32535:
-
Component/s: (was: Spark Core)
 SQL

> Query with broadcast hints fail when query has a WITH clause
> 
>
> Key: SPARK-32535
> URL: https://issues.apache.org/jira/browse/SPARK-32535
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Arvind Krishnan
>Priority: Major
>
> If a query has a WITH clause and a query hint (like `BROADCAST`), the query 
> fails
> In the below code sample, executing `sql2` fails, but `sql1` passes.
> {code:java}
> import spark.implicits._
> val df = List(
>   ("1", "B", "C"),
>   ("A", "2", "C"),
>   ("A", "B", "3")
> ).toDF("COL_A", "COL_B", "COL_C")
> df.createOrReplaceTempView("table1")
> val df1 = List(
>   ("A", "2", "3"),
>   ("1", "B", "3"),
>   ("1", "2", "C")
> ).toDF("COL_A", "COL_B", "COL_C")
> df1.createOrReplaceTempView("table2")
> val sql1 = "select /*+ BROADCAST(a) */ a.COL_A from table1 a inner join 
> table2 b on a.COL_A = b.COL_A"
> val sql2 = "with X as (select /*+ BROADCAST(a) */ a.COL_A from table1 a inner 
> join table2 b on a.COL_A = b.COL_A) select X.COL_A from X"
> val df2 = spark.sql(sql2)
> println(s"Row Count ${df2.count()}")
> println("Rows... ")
> df2.show(false)
> {code}
>  
> I tried executing this sample program with spark2.4.0, and both sql 
> statements work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32536) deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement to dynamic partition

2020-08-04 Thread yx91490 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yx91490 updated SPARK-32536:

Description: 
when execute insert overwrite table statement to dynamic partition :

 
{code:java}
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrict;
insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
where dt='2001';
{code}
output log:
{code:java}
20/08/05 14:38:05 ERROR Hive: Exception when loading partition with parameters  
partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
  table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
listBucketingLevel=0,  isAcid=false,  resetStatistics=false
org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
cleaned up.
at 
org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File 
hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
at 
org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
at 
org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
... 8 more
Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
when loading 1 in table id_name2 with 
loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
{code}
it seems that spark doesn't test if the partitions hdfs locations whether 
exists before delete it.

and Hive can successfully execute the same sql.

  was:
when execute insert overwrite table statement to dynamic partition :

 
{code:java}
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrict;
insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
where dt='2001';
{code}
output log:
{code:java}
20/08/05 14:38:05 ERROR Hive: Exception when loading partition with parameters  
partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
  table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
listBucketingLevel=0,  isAcid=false,  resetStatistics=false
org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
cleaned up.
at 
org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File 
hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$24.

[jira] [Updated] (SPARK-32536) deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement to dynamic partition

2020-08-04 Thread yx91490 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yx91490 updated SPARK-32536:

Summary: deleted not existing hdfs locations when use spark sql to execute 
"insert overwrite" statement to dynamic partition  (was: deleted not existing 
hdfs locations when use spark sql to execute "insert overwrite" statement with 
dynamic partition)

> deleted not existing hdfs locations when use spark sql to execute "insert 
> overwrite" statement to dynamic partition
> ---
>
> Key: SPARK-32536
> URL: https://issues.apache.org/jira/browse/SPARK-32536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32536.full.log
>
>
> when execute insert overwrite table statement to dynamic partition :
>  
> {code:java}
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nostrict;
> insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
> where dt='2001';
> {code}
> output log:
> {code:java}
> 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with 
> parameters  
> partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
>   table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
> listBucketingLevel=0,  isAcid=false,  resetStatistics=false
> org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
> cleaned up.
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
> at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
> ... 8 more
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
> when loading 1 in table id_name2 with 
> loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
> {code}
> it seems that spark doesn't test if the partitions hdfs locations whether 
> exists before delete it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32536) deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement with dynamic partition

2020-08-04 Thread yx91490 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yx91490 updated SPARK-32536:

Summary: deleted not existing hdfs locations when use spark sql to execute 
"insert overwrite" statement with dynamic partition  (was: deleted not existin 
hdfs locations when use spark sql to execute "insert overwrite" statement with 
dynamic partition)

> deleted not existing hdfs locations when use spark sql to execute "insert 
> overwrite" statement with dynamic partition
> -
>
> Key: SPARK-32536
> URL: https://issues.apache.org/jira/browse/SPARK-32536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32536.full.log
>
>
> when execute insert overwrite table statement to dynamic partition :
>  
> {code:java}
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nostrict;
> insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
> where dt='2001';
> {code}
> output log:
> {code:java}
> 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with 
> parameters  
> partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
>   table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
> listBucketingLevel=0,  isAcid=false,  resetStatistics=false
> org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
> cleaned up.
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
> at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
> ... 8 more
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
> when loading 1 in table id_name2 with 
> loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
> {code}
> it seems that spark doesn't test if the partitions hdfs locations whether 
> exists before delete it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32536) deleted not existin hdfs locations when use spark sql to execute "insert overwrite" statement with dynamic partition

2020-08-04 Thread yx91490 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yx91490 updated SPARK-32536:

Summary: deleted not existin hdfs locations when use spark sql to execute 
"insert overwrite" statement with dynamic partition  (was: deleted not existin 
hdfs locations when use spark sql to execute "insert overwrite" dynamic 
partition statement)

> deleted not existin hdfs locations when use spark sql to execute "insert 
> overwrite" statement with dynamic partition
> 
>
> Key: SPARK-32536
> URL: https://issues.apache.org/jira/browse/SPARK-32536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32536.full.log
>
>
> when execute insert overwrite table statement to dynamic partition :
>  
> {code:java}
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nostrict;
> insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
> where dt='2001';
> {code}
> output log:
> {code:java}
> 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with 
> parameters  
> partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
>   table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
> listBucketingLevel=0,  isAcid=false,  resetStatistics=false
> org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
> cleaned up.
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
> at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
> ... 8 more
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
> when loading 1 in table id_name2 with 
> loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
> {code}
> it seems that spark doesn't test if the partitions hdfs locations whether 
> exists before delete it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32536) deleted not existin hdfs locations when use spark sql to execute "insert overwrite" dynamic partition statement

2020-08-04 Thread yx91490 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yx91490 updated SPARK-32536:

Summary: deleted not existin hdfs locations when use spark sql to execute 
"insert overwrite" dynamic partition statement  (was: deleted not existing 
partition hdfs locations when use spark sql to execute "insert overwrite" 
dynamic partition statement)

> deleted not existin hdfs locations when use spark sql to execute "insert 
> overwrite" dynamic partition statement
> ---
>
> Key: SPARK-32536
> URL: https://issues.apache.org/jira/browse/SPARK-32536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32536.full.log
>
>
> when execute insert overwrite table statement to dynamic partition :
>  
> {code:java}
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nostrict;
> insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
> where dt='2001';
> {code}
> output log:
> {code:java}
> 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with 
> parameters  
> partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
>   table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
> listBucketingLevel=0,  isAcid=false,  resetStatistics=false
> org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
> cleaned up.
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
> at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
> ... 8 more
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
> when loading 1 in table id_name2 with 
> loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
> {code}
> it seems that spark doesn't test if the partitions hdfs locations whether 
> exists before delete it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32536) deleted not existing partition hdfs locations when use spark sql to execute "insert overwrite" dynamic partition statement

2020-08-04 Thread yx91490 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yx91490 updated SPARK-32536:

Summary: deleted not existing partition hdfs locations when use spark sql 
to execute "insert overwrite" dynamic partition statement  (was: spark sql 
insert overwrite dynamic partition deleted not existing partition hdfs 
locations)

> deleted not existing partition hdfs locations when use spark sql to execute 
> "insert overwrite" dynamic partition statement
> --
>
> Key: SPARK-32536
> URL: https://issues.apache.org/jira/browse/SPARK-32536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32536.full.log
>
>
> when execute insert overwrite table statement to dynamic partition :
>  
> {code:java}
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nostrict;
> insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
> where dt='2001';
> {code}
> output log:
> {code:java}
> 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with 
> parameters  
> partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
>   table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
> listBucketingLevel=0,  isAcid=false,  resetStatistics=false
> org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
> cleaned up.
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
> at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
> ... 8 more
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
> when loading 1 in table id_name2 with 
> loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
> {code}
> it seems that spark doesn't test if the partitions hdfs locations whether 
> exists before delete it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32536) spark sql insert overwrite dynamic partition deleted not existing partition hdfs locations

2020-08-04 Thread yx91490 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yx91490 updated SPARK-32536:

Summary: spark sql insert overwrite dynamic partition deleted not existing 
partition hdfs locations  (was: spark sql insert overwrite dynamic partition 
error)

> spark sql insert overwrite dynamic partition deleted not existing partition 
> hdfs locations
> --
>
> Key: SPARK-32536
> URL: https://issues.apache.org/jira/browse/SPARK-32536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32536.full.log
>
>
> when execute insert overwrite table statement to dynamic partition :
>  
> {code:java}
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nostrict;
> insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
> where dt='2001';
> {code}
> output log:
> {code:java}
> 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with 
> parameters  
> partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
>   table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
> listBucketingLevel=0,  isAcid=false,  resetStatistics=false
> org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
> cleaned up.
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
> at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
> ... 8 more
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
> when loading 1 in table id_name2 with 
> loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
> {code}
> it seems that spark doesn't test if the partitions hdfs locations whether 
> exists before delete it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32536) spark sql insert overwrite dynamic partition error

2020-08-04 Thread yx91490 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yx91490 updated SPARK-32536:

Description: 
when execute insert overwrite table statement to dynamic partition :

 
{code:java}
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrict;
insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
where dt='2001';
{code}
output log:
{code:java}
20/08/05 14:38:05 ERROR Hive: Exception when loading partition with parameters  
partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
  table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
listBucketingLevel=0,  isAcid=false,  resetStatistics=false
org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
cleaned up.
at 
org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File 
hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
at 
org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
at 
org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
... 8 more
Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
when loading 1 in table id_name2 with 
loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
{code}
it seems that spark doesn't test if the partitions hdfs locations whether 
exists before delete it.

  was:
when execute insert overwrite table statement to dynamic partition :

 
{code:java}
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrict;
insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
where dt='2001';
{code}
output log:
{code:java}
20/08/05 14:38:05 ERROR Hive: Exception when loading partition with parameters  
partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
  table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
listBucketingLevel=0,  isAcid=false,  resetStatistics=false
org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
cleaned up.
at 
org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File 
hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
at 
o

[jira] [Updated] (SPARK-31851) Redesign PySpark documentation

2020-08-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31851:
-
Description: 
Currently, PySpark documentation 
(https://spark.apache.org/docs/latest/api/python/index.html) is pretty much 
poorly written compared to other projects.
See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an 
example.

PySpark is being more and more important in Spark, and we should improve this 
documentation so people can easily follow.

Reference: 
- https://koalas.readthedocs.io/en/latest/
- https://pandas.pydata.org/docs/

  was:
Currently, PySpark documentation 
(https://spark.apache.org/docs/latest/api/python/index.html) is pretty much 
poorly written compared to other projects.
See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an 
exmaple.

PySpark is being more and more important in Spark, and we should improve this 
documentation so people can easily follow.

Reference: 
- https://koalas.readthedocs.io/en/latest/
- https://pandas.pydata.org/docs/


> Redesign PySpark documentation
> --
>
> Key: SPARK-31851
> URL: https://issues.apache.org/jira/browse/SPARK-31851
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> Currently, PySpark documentation 
> (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much 
> poorly written compared to other projects.
> See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an 
> example.
> PySpark is being more and more important in Spark, and we should improve this 
> documentation so people can easily follow.
> Reference: 
> - https://koalas.readthedocs.io/en/latest/
> - https://pandas.pydata.org/docs/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31851) Redesign PySpark documentation

2020-08-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31851:
-
Description: 
Currently, PySpark documentation 
(https://spark.apache.org/docs/latest/api/python/index.html) is pretty much 
poorly written compared to other projects.
See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an 
exmaple.

PySpark is being more and more important in Spark, and we should improve this 
documentation so people can easily follow.

Reference: 
- https://koalas.readthedocs.io/en/latest/
- https://pandas.pydata.org/docs/

  was:
Currently, PySpark documentation 
(https://spark.apache.org/docs/latest/api/python/index.html) is pretty much 
poorly written compared to other projects.
See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an 
exmaple.

PySpark is being more and more important in Spark, and we should improve this 
documentation so people can easily follow.

Reference: 
- https://koalas.readthedocs.io/en/latest/


> Redesign PySpark documentation
> --
>
> Key: SPARK-31851
> URL: https://issues.apache.org/jira/browse/SPARK-31851
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> Currently, PySpark documentation 
> (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much 
> poorly written compared to other projects.
> See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an 
> exmaple.
> PySpark is being more and more important in Spark, and we should improve this 
> documentation so people can easily follow.
> Reference: 
> - https://koalas.readthedocs.io/en/latest/
> - https://pandas.pydata.org/docs/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32536) spark sql insert overwrite dynamic partition error

2020-08-04 Thread yx91490 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yx91490 updated SPARK-32536:

Attachment: SPARK-32536.full.log

> spark sql insert overwrite dynamic partition error
> --
>
> Key: SPARK-32536
> URL: https://issues.apache.org/jira/browse/SPARK-32536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32536.full.log
>
>
> when execute insert overwrite table statement to dynamic partition :
>  
> {code:java}
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nostrict;
> insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
> where dt='2001';
> {code}
> output log:
> {code:java}
> 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with 
> parameters  
> partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
>   table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
> listBucketingLevel=0,  isAcid=false,  resetStatistics=false
> org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
> cleaned up.
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
> at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
> ... 8 more
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
> when loading 1 in table id_name2 with 
> loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32536) spark sql insert overwrite dynamic partition error

2020-08-04 Thread yx91490 (Jira)

yx91490 created SPARK-32536:
---

 Summary: spark sql insert overwrite dynamic partition error
 Key: SPARK-32536
 URL: https://issues.apache.org/jira/browse/SPARK-32536
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
 Environment: HDP version 2.3.2.3.1.4.0-315
Reporter: yx91490


when execute insert overwrite table statement to dynamic partition :

 
{code:java}
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nostrict;
insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
where dt='2001';
{code}
output log:
{code:java}
20/08/05 14:38:05 ERROR Hive: Exception when loading partition with parameters  
partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
  table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
listBucketingLevel=0,  isAcid=false,  resetStatistics=false
org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
cleaned up.
at 
org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File 
hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
at 
org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
at 
org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
... 8 more
Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
when loading 1 in table id_name2 with 
loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32187) User Guide - Shipping Python Package

2020-08-04 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171284#comment-17171284
 ] 

Hyukjin Kwon commented on SPARK-32187:
--

[~fhoering], I made one example at SPARK-32507 to refer. Please also see 
https://issues.apache.org/jira/browse/SPARK-31851?focusedCommentId=17171275&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17171275

Would you be able to start writing a page up about PEX?
If you're not used to shipping Python packages with zipped files or .py files, 
you can only write it only about the PEX for now. I can file a separate JIRA 
for that if that's better for you.



> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31851) Redesign PySpark documentation

2020-08-04 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171275#comment-17171275
 ] 

Hyukjin Kwon edited comment on SPARK-31851 at 8/5/20, 6:24 AM:
---

SPARK-32507 was merged. People should be able to refer this as an example. If 
you guys are interested in taking some of sub-tasks here, please feel free to 
go ahead!

Useful links: http://spark.apache.org/contributing.html
Build the doc:
  - Official way: https://github.com/apache/spark/tree/master/docs
  - Unofficial way: {{cd python/docs && make clean html}} and open 
{{python/docs/build/html/index.html}}.
  - See also dependencies needed to build the doc 
https://github.com/apache/spark/blob/master/.github/workflows/master.yml#L230

As an example, if you're adding a page under "User Guide", you might have to do:

1. go to the source:

{code}
cd python/docs/source/user_guide
{code}

2. Write up a page you want. Let's suppose you wrote `shipping_pagkages.rst`.

3. Put it in {{python/docs/source/user_guide/index.rst}}:

{code}
...
==
User Guide
==

.. toctree::
:maxdepth: 2

shipping_pagkages
...
{code}


was (Author: hyukjin.kwon):
SPARK-32507 was merged. People should be able to refer this as an example. If 
you guys are interested in taking some of sub-tasks here, please feel free to 
go ahead!

Useful links: http://spark.apache.org/contributing.html
Build the doc:
  - Official way: https://github.com/apache/spark/tree/master/docs
  - Unofficial way: cd python/docs && make clean html
  - See also dependencies needed to build the doc 
https://github.com/apache/spark/blob/master/.github/workflows/master.yml#L230

As an example, if you're adding a page under "User Guide", you might have to do:

1. go to the source:

{code}
cd python/docs/source/user_guide
{code}

2. Write up a page you want. Let's suppose you wrote `shipping_pagkages.rst`.

3. Put it in {{python/docs/source/user_guide/index.rst}}:

{code}
...
==
User Guide
==

.. toctree::
:maxdepth: 2

shipping_pagkages
...
{code}

> Redesign PySpark documentation
> --
>
> Key: SPARK-31851
> URL: https://issues.apache.org/jira/browse/SPARK-31851
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> Currently, PySpark documentation 
> (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much 
> poorly written compared to other projects.
> See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an 
> exmaple.
> PySpark is being more and more important in Spark, and we should improve this 
> documentation so people can easily follow.
> Reference: 
> - https://koalas.readthedocs.io/en/latest/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31851) Redesign PySpark documentation

2020-08-04 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171275#comment-17171275
 ] 

Hyukjin Kwon edited comment on SPARK-31851 at 8/5/20, 6:23 AM:
---

SPARK-32507 was merged. People should be able to refer this as an example. If 
you guys are interested in taking some of sub-tasks here, please feel free to 
go ahead!

Useful links: http://spark.apache.org/contributing.html
Build the doc:
  - Official way: https://github.com/apache/spark/tree/master/docs
  - Unofficial way: cd python/docs && make clean html
  - See also dependencies needed to build the doc 
https://github.com/apache/spark/blob/master/.github/workflows/master.yml#L230

As an example, if you're adding a page under "User Guide", you might have to do:

1. go to the source:

{code}
cd python/docs/source/user_guide
{code}

2. Write up a page you want. Let's suppose you wrote `shipping_pagkages.rst`.

3. Put it in {{python/docs/source/user_guide/index.rst}}:

{code}
...
==
User Guide
==

.. toctree::
:maxdepth: 2

shipping_pagkages
...
{code}


was (Author: hyukjin.kwon):
SPARK-32507 was merged. People should be able to refer this as an example. If 
you guys are interested in taking some of sub-tasks here, please feel free to 
go ahead!

Useful links: http://spark.apache.org/contributing.html
Build the doc:
  - Official way: https://github.com/apache/spark/tree/master/docs
  - Unofficial way: cd python/docs && make clean html
  - See also dependencies needed to build the doc 
https://github.com/apache/spark/blob/master/.github/workflows/master.yml#L230

> Redesign PySpark documentation
> --
>
> Key: SPARK-31851
> URL: https://issues.apache.org/jira/browse/SPARK-31851
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> Currently, PySpark documentation 
> (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much 
> poorly written compared to other projects.
> See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an 
> exmaple.
> PySpark is being more and more important in Spark, and we should improve this 
> documentation so people can easily follow.
> Reference: 
> - https://koalas.readthedocs.io/en/latest/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31851) Redesign PySpark documentation

2020-08-04 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171275#comment-17171275
 ] 

Hyukjin Kwon edited comment on SPARK-31851 at 8/5/20, 6:14 AM:
---

SPARK-32507 was merged. People should be able to refer this as an example. If 
you guys are interested in taking some of sub-tasks here, please feel free to 
go ahead!

Useful links: http://spark.apache.org/contributing.html
Build the doc:
  - Official way: https://github.com/apache/spark/tree/master/docs
  - Unofficial way: cd python/docs && make clean html
  - See also dependencies needed to build the doc 
https://github.com/apache/spark/blob/master/.github/workflows/master.yml#L230


was (Author: hyukjin.kwon):
SPARK-32507 was merged. People should be able to refer this as an example. If 
you guys are interested in taking some of sub-tasks here, please feel free to 
go ahead!

> Redesign PySpark documentation
> --
>
> Key: SPARK-31851
> URL: https://issues.apache.org/jira/browse/SPARK-31851
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> Currently, PySpark documentation 
> (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much 
> poorly written compared to other projects.
> See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an 
> exmaple.
> PySpark is being more and more important in Spark, and we should improve this 
> documentation so people can easily follow.
> Reference: 
> - https://koalas.readthedocs.io/en/latest/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31851) Redesign PySpark documentation

2020-08-04 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171275#comment-17171275
 ] 

Hyukjin Kwon commented on SPARK-31851:
--

SPARK-32507 was merged. People should be able to refer this as an example. If 
you guys are interested in taking some of sub-tasks here, please feel free to 
go ahead!

> Redesign PySpark documentation
> --
>
> Key: SPARK-31851
> URL: https://issues.apache.org/jira/browse/SPARK-31851
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>
> Currently, PySpark documentation 
> (https://spark.apache.org/docs/latest/api/python/index.html) is pretty much 
> poorly written compared to other projects.
> See, for example, see Koalas https://koalas.readthedocs.io/en/latest/ as an 
> exmaple.
> PySpark is being more and more important in Spark, and we should improve this 
> documentation so people can easily follow.
> Reference: 
> - https://koalas.readthedocs.io/en/latest/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32528) The analyze method should make sure the plan is analyzed

2020-08-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32528:

Description: In tests, we usually call `plan.analyze` to get the analyzed 
plan and test analyzer/optimizer rules. However, `plan.analyze` doesn't 
guarantee the plan is resolved, which may surprise the test writers.

> The analyze method should make sure the plan is analyzed
> 
>
> Key: SPARK-32528
> URL: https://issues.apache.org/jira/browse/SPARK-32528
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>
> In tests, we usually call `plan.analyze` to get the analyzed plan and test 
> analyzer/optimizer rules. However, `plan.analyze` doesn't guarantee the plan 
> is resolved, which may surprise the test writers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32507) Main Page

2020-08-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32507.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29320
[https://github.com/apache/spark/pull/29320]

> Main Page
> -
>
> Key: SPARK-32507
> URL: https://issues.apache.org/jira/browse/SPARK-32507
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> We should make a main package to overview PySpark properly. See the demo 
> example:
>  https://hyukjin-spark.readthedocs.io/en/latest/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32535) Query with broadcast hints fail when query has a WITH clause

2020-08-04 Thread Arvind Krishnan (Jira)

Arvind Krishnan created SPARK-32535:
---

 Summary: Query with broadcast hints fail when query has a WITH 
clause
 Key: SPARK-32535
 URL: https://issues.apache.org/jira/browse/SPARK-32535
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Arvind Krishnan


If a query has a WITH clause and a query hint (like `BROADCAST`), the query 
fails

In the below code sample, executing `sql2` fails, but `sql1` passes.
{code:java}
import spark.implicits._
val df = List(
  ("1", "B", "C"),
  ("A", "2", "C"),
  ("A", "B", "3")
).toDF("COL_A", "COL_B", "COL_C")
df.createOrReplaceTempView("table1")

val df1 = List(
  ("A", "2", "3"),
  ("1", "B", "3"),
  ("1", "2", "C")
).toDF("COL_A", "COL_B", "COL_C")
df1.createOrReplaceTempView("table2")

val sql1 = "select /*+ BROADCAST(a) */ a.COL_A from table1 a inner join table2 
b on a.COL_A = b.COL_A"
val sql2 = "with X as (select /*+ BROADCAST(a) */ a.COL_A from table1 a inner 
join table2 b on a.COL_A = b.COL_A) select X.COL_A from X"
val df2 = spark.sql(sql2)

println(s"Row Count ${df2.count()}")
println("Rows... ")
df2.show(false)
{code}
 

I tried executing this sample program with spark2.4.0, and both sql statements 
work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32502) Please fix CVE related to Guava 14.0.1

2020-08-04 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171197#comment-17171197
 ] 

L. C. Hsieh commented on SPARK-32502:
-

I did some testings in the PRs. Few changes are required to pass the failed 
Hive tests:

# Shading Guava at hive-exec packaging and a few code changes to 
hive-common and hive-exec regarding Guava usage
# Don't use core classifier for hive dependencies in Spark

But this just upgrades Guava version used in Spark. Hive dependencies still use 
older Guava with the reported CVE.



> Please fix CVE related to Guava 14.0.1
> --
>
> Key: SPARK-32502
> URL: https://issues.apache.org/jira/browse/SPARK-32502
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Rodney Aaron Stainback
>Priority: Major
>
> Please fix the following CVE related to Guava 14.0.1
> |cve|severity|cvss|
> |CVE-2018-10237|medium|5.9|
>  
> Our security team is trying to block us from using spark because of this issue
>  
> One thing that's very weird is I see from this [pom 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/pom.xml]]
>  you reference guava but it's not clear what version.
>  
> But if I look on 
> [maven|[https://mvnrepository.com/artifact/org.apache.spark/spark-network-common_2.12/3.0.0]]
>  the guava reference is not showing up
>  
> Is this reference somehow being shaded into the network common jar?  It's not 
> clear to me.
>  
> Also, I've noticed code like [this 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/src/main/java/org/apache/spark/network/util/LimitedInputStream.java]]
>  which is a copy-paste of some guava source code.
>  
> The CVE scanner we use Twistlock/Palo Alto Networks - Prisma Cloud Compute 
> Edition is very thorough and will find CVEs in copy-pasted code and shaded 
> jars.
>  
> Please fix this CVE so we can use spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31419) Document Table-valued Function and Inline Table

2020-08-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171187#comment-17171187
 ] 

Apache Spark commented on SPARK-31419:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29355

> Document Table-valued Function and Inline Table
> ---
>
> Key: SPARK-31419
> URL: https://issues.apache.org/jira/browse/SPARK-31419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Document Table-valued Function and Inline Table



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32533) Improve Avro read/write performance on nested structs and array of structs

2020-08-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171184#comment-17171184
 ] 

Apache Spark commented on SPARK-32533:
--

User 'msamirkhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29354

> Improve Avro read/write performance on nested structs and array of structs
> --
>
> Key: SPARK-32533
> URL: https://issues.apache.org/jira/browse/SPARK-32533
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Have some improvements for Avro file format to reduce time taken when 
> reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was 
> able to improve performance on branch-3.0 as follows (measurements in 
> seconds):
> Read:
> Nested Structs: 75 -> 46
> Array of Struct: 47 -> 17
> Write
> Nested Structs: 147 -> 36
> Array of Struct: 139 -> 34
> Will be putting up the PR soon with the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31419) Document Table-valued Function and Inline Table

2020-08-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171185#comment-17171185
 ] 

Apache Spark commented on SPARK-31419:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29355

> Document Table-valued Function and Inline Table
> ---
>
> Key: SPARK-31419
> URL: https://issues.apache.org/jira/browse/SPARK-31419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Document Table-valued Function and Inline Table



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32533) Improve Avro read/write performance on nested structs and array of structs

2020-08-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32533:


Assignee: Apache Spark

> Improve Avro read/write performance on nested structs and array of structs
> --
>
> Key: SPARK-32533
> URL: https://issues.apache.org/jira/browse/SPARK-32533
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Assignee: Apache Spark
>Priority: Major
>
> Have some improvements for Avro file format to reduce time taken when 
> reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was 
> able to improve performance on branch-3.0 as follows (measurements in 
> seconds):
> Read:
> Nested Structs: 75 -> 46
> Array of Struct: 47 -> 17
> Write
> Nested Structs: 147 -> 36
> Array of Struct: 139 -> 34
> Will be putting up the PR soon with the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32533) Improve Avro read/write performance on nested structs and array of structs

2020-08-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32533:


Assignee: (was: Apache Spark)

> Improve Avro read/write performance on nested structs and array of structs
> --
>
> Key: SPARK-32533
> URL: https://issues.apache.org/jira/browse/SPARK-32533
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Have some improvements for Avro file format to reduce time taken when 
> reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was 
> able to improve performance on branch-3.0 as follows (measurements in 
> seconds):
> Read:
> Nested Structs: 75 -> 46
> Array of Struct: 47 -> 17
> Write
> Nested Structs: 147 -> 36
> Array of Struct: 139 -> 34
> Will be putting up the PR soon with the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32533) Improve Avro read/write performance on nested structs and array of structs

2020-08-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171183#comment-17171183
 ] 

Apache Spark commented on SPARK-32533:
--

User 'msamirkhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29354

> Improve Avro read/write performance on nested structs and array of structs
> --
>
> Key: SPARK-32533
> URL: https://issues.apache.org/jira/browse/SPARK-32533
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Have some improvements for Avro file format to reduce time taken when 
> reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was 
> able to improve performance on branch-3.0 as follows (measurements in 
> seconds):
> Read:
> Nested Structs: 75 -> 46
> Array of Struct: 47 -> 17
> Write
> Nested Structs: 147 -> 36
> Array of Struct: 139 -> 34
> Will be putting up the PR soon with the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32534) Cannot load a Pipeline Model on a stopped Spark Context

2020-08-04 Thread Kevin Van Lieshout (Jira)

Kevin Van Lieshout created SPARK-32534:
--

 Summary: Cannot load a Pipeline Model on a stopped Spark Context
 Key: SPARK-32534
 URL: https://issues.apache.org/jira/browse/SPARK-32534
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Kubernetes
Affects Versions: 2.4.6
Reporter: Kevin Van Lieshout


I am running Spark in a Kubernetes cluster than is running Spark NLP using the 
Pyspark ML Pipeline Model class to load the model and then transform on the 
spark dataframe. We run this within a docker container that starts up a spark 
context, mounts volumes, spins up executors, etc and then does it 
transformations, udfs, etc and then closes down the spark context. The first 
time I load the model when my service has just been started, everything is 
fine. If I run my application for a second time without resetting my service, 
even though the context is entirely stopped from the previous run and a new one 
is started up, the Pipeline Model has some attribute in one of its base classes 
that thinks the context its running on is closed, so then I get a : cannot call 
a function on a stopped spark context when I try and load the model in my 
service again. I have to shut down my service each time if I want consecutive 
runs through my spark pipeline, which is not ideal, so I was wondering if this 
was a common issue amongst fellow pyspark users that use Pipeline Model, or is 
there a common work around to resetting all spark contexts or whether the 
pipeline model caches a spark context of some sort. Any help is very useful. 

 

 
cls.pipeline = PipelineModel.read().load(NLP_MODEL)
 
is how I load the model. And our spark context is very similar to a typical 
kubernetes/spark setup. Nothing special there



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs

2020-08-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32532:


Assignee: (was: Apache Spark)

> Improve ORC read/write performance on nested structs and array of structs
> -
>
> Key: SPARK-32532
> URL: https://issues.apache.org/jira/browse/SPARK-32532
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Have some improvements for ORC file format to reduce time taken when 
> reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was 
> able to improve performance on branch-3.0 as follows (measurements in 
> seconds):
> Read:
>  Nested Structs: 184 -> 44
>  Array of Struct: 66 -> 15
> Write
>  Nested Structs: 543 -> 39
>  Array of Struct: 330 -> 37
> Will be putting up the PR soon with the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs

2020-08-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171118#comment-17171118
 ] 

Apache Spark commented on SPARK-32532:
--

User 'msamirkhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29353

> Improve ORC read/write performance on nested structs and array of structs
> -
>
> Key: SPARK-32532
> URL: https://issues.apache.org/jira/browse/SPARK-32532
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Have some improvements for ORC file format to reduce time taken when 
> reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was 
> able to improve performance on branch-3.0 as follows (measurements in 
> seconds):
> Read:
>  Nested Structs: 184 -> 44
>  Array of Struct: 66 -> 15
> Write
>  Nested Structs: 543 -> 39
>  Array of Struct: 330 -> 37
> Will be putting up the PR soon with the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs

2020-08-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171117#comment-17171117
 ] 

Apache Spark commented on SPARK-32532:
--

User 'msamirkhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29353

> Improve ORC read/write performance on nested structs and array of structs
> -
>
> Key: SPARK-32532
> URL: https://issues.apache.org/jira/browse/SPARK-32532
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Have some improvements for ORC file format to reduce time taken when 
> reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was 
> able to improve performance on branch-3.0 as follows (measurements in 
> seconds):
> Read:
>  Nested Structs: 184 -> 44
>  Array of Struct: 66 -> 15
> Write
>  Nested Structs: 543 -> 39
>  Array of Struct: 330 -> 37
> Will be putting up the PR soon with the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs

2020-08-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32532:


Assignee: Apache Spark

> Improve ORC read/write performance on nested structs and array of structs
> -
>
> Key: SPARK-32532
> URL: https://issues.apache.org/jira/browse/SPARK-32532
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Assignee: Apache Spark
>Priority: Major
>
> Have some improvements for ORC file format to reduce time taken when 
> reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was 
> able to improve performance on branch-3.0 as follows (measurements in 
> seconds):
> Read:
>  Nested Structs: 184 -> 44
>  Array of Struct: 66 -> 15
> Write
>  Nested Structs: 543 -> 39
>  Array of Struct: 330 -> 37
> Will be putting up the PR soon with the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats

2020-08-04 Thread Muhammad Samir Khan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated SPARK-32531:

Description: We had found that Spark performance was slow as compared to 
PIG on some schemas in our pipelines. On investigation, it was found that Spark 
performance was slow for nested structs and array'd structs and these cases 
were not being profiled by the current benchmarks. I have some improvements for 
ORC (SPARK-32532) and Avro (SPARK-32533) file formats which improve the 
performance in these cases and will be putting up the PRs soon.  (was: 
Additions to benchmarks for different file formats for nested structs and 
arrays which are not being currently benchmarked. I have some improvements for 
ORC and Avro file formats which improve the performance in these cases.

I will be putting up the PRs soon.)

> Add benchmarks for nested structs and arrays for different file formats
> ---
>
> Key: SPARK-32531
> URL: https://issues.apache.org/jira/browse/SPARK-32531
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> We had found that Spark performance was slow as compared to PIG on some 
> schemas in our pipelines. On investigation, it was found that Spark 
> performance was slow for nested structs and array'd structs and these cases 
> were not being profiled by the current benchmarks. I have some improvements 
> for ORC (SPARK-32532) and Avro (SPARK-32533) file formats which improve the 
> performance in these cases and will be putting up the PRs soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats

2020-08-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171108#comment-17171108
 ] 

Apache Spark commented on SPARK-32531:
--

User 'msamirkhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29352

> Add benchmarks for nested structs and arrays for different file formats
> ---
>
> Key: SPARK-32531
> URL: https://issues.apache.org/jira/browse/SPARK-32531
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Additions to benchmarks for different file formats for nested structs and 
> arrays which are not being currently benchmarked. I have some improvements 
> for ORC and Avro file formats which improve the performance in these cases.
> I will be putting up the PRs soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats

2020-08-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171106#comment-17171106
 ] 

Apache Spark commented on SPARK-32531:
--

User 'msamirkhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29352

> Add benchmarks for nested structs and arrays for different file formats
> ---
>
> Key: SPARK-32531
> URL: https://issues.apache.org/jira/browse/SPARK-32531
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Additions to benchmarks for different file formats for nested structs and 
> arrays which are not being currently benchmarked. I have some improvements 
> for ORC and Avro file formats which improve the performance in these cases.
> I will be putting up the PRs soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats

2020-08-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32531:


Assignee: Apache Spark

> Add benchmarks for nested structs and arrays for different file formats
> ---
>
> Key: SPARK-32531
> URL: https://issues.apache.org/jira/browse/SPARK-32531
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Assignee: Apache Spark
>Priority: Major
>
> Additions to benchmarks for different file formats for nested structs and 
> arrays which are not being currently benchmarked. I have some improvements 
> for ORC and Avro file formats which improve the performance in these cases.
> I will be putting up the PRs soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats

2020-08-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32531:


Assignee: (was: Apache Spark)

> Add benchmarks for nested structs and arrays for different file formats
> ---
>
> Key: SPARK-32531
> URL: https://issues.apache.org/jira/browse/SPARK-32531
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Additions to benchmarks for different file formats for nested structs and 
> arrays which are not being currently benchmarked. I have some improvements 
> for ORC and Avro file formats which improve the performance in these cases.
> I will be putting up the PRs soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32531) Add benchmarks for nested structs and arrays for different file formats

2020-08-04 Thread Muhammad Samir Khan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated SPARK-32531:

Summary: Add benchmarks for nested structs and arrays for different file 
formats  (was: Add benchmarks for nested structs and arrays for different data 
types)

> Add benchmarks for nested structs and arrays for different file formats
> ---
>
> Key: SPARK-32531
> URL: https://issues.apache.org/jira/browse/SPARK-32531
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Additions to benchmarks for different file formats for nested structs and 
> arrays which are not being currently benchmarked. I have some improvements 
> for ORC and Avro file formats which improve the performance in these cases.
> I will be putting up the PRs soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs

2020-08-04 Thread Muhammad Samir Khan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated SPARK-32532:

Description: 
Have some improvements for ORC file format to reduce time taken when 
reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was 
able to improve performance on branch-3.0 as follows (measurements in seconds):

Read:
 Nested Structs: 184 -> 44
 Array of Struct: 66 -> 15

Write
 Nested Structs: 543 -> 39
 Array of Struct: 330 -> 37

Will be putting up the PR soon with the changes.

  was:
Have some improvements for ORC file format to reduce time taken when 
reading/writing nested/array'd structs. Using benchmarks in [SPARK-32531] was 
able to improve performance as follows (measurements in seconds):

Read:
 Nested Structs: 184 -> 44
 Array of Struct: 66 -> 15

Write
 Nested Structs: 543 -> 39
 Array of Struct: 330 -> 37

Will be putting up the PR soon with the changes.


> Improve ORC read/write performance on nested structs and array of structs
> -
>
> Key: SPARK-32532
> URL: https://issues.apache.org/jira/browse/SPARK-32532
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Have some improvements for ORC file format to reduce time taken when 
> reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was 
> able to improve performance on branch-3.0 as follows (measurements in 
> seconds):
> Read:
>  Nested Structs: 184 -> 44
>  Array of Struct: 66 -> 15
> Write
>  Nested Structs: 543 -> 39
>  Array of Struct: 330 -> 37
> Will be putting up the PR soon with the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32533) Improve Avro read/write performance on nested structs and array of structs

2020-08-04 Thread Muhammad Samir Khan (Jira)

Muhammad Samir Khan created SPARK-32533:
---

 Summary: Improve Avro read/write performance on nested structs and 
array of structs
 Key: SPARK-32533
 URL: https://issues.apache.org/jira/browse/SPARK-32533
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Muhammad Samir Khan


Have some improvements for Avro file format to reduce time taken when 
reading/writing nested/array'd structs. Using benchmarks in SPARK-32531 was 
able to improve performance on branch-3.0 as follows (measurements in seconds):

Read:
Nested Structs: 75 -> 46
Array of Struct: 47 -> 17

Write
Nested Structs: 147 -> 36
Array of Struct: 139 -> 34

Will be putting up the PR soon with the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs

2020-08-04 Thread Muhammad Samir Khan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Muhammad Samir Khan updated SPARK-32532:

Description: 
Have some improvements for ORC file format to reduce time taken when 
reading/writing nested/array'd structs. Using benchmarks in [SPARK-32531] was 
able to improve performance as follows (measurements in seconds):

Read:
 Nested Structs: 184 -> 44
 Array of Struct: 66 -> 15

Write
 Nested Structs: 543 -> 39
 Array of Struct: 330 -> 37

Will be putting up the PR soon with the changes.

  was:
Have some improvements for ORC file format to reduce time taken when 
reading/writing nested/array'd structs. Using benchmarks in [SPARK-32071] was 
able to improve performance as follows (measurements in seconds):

Read:
Nested Structs: 184 -> 44
Array of Struct: 66 -> 15

Write
Nested Structs: 543 -> 39
Array of Struct: 330 -> 37

Will be putting up the PR soon with the changes.


> Improve ORC read/write performance on nested structs and array of structs
> -
>
> Key: SPARK-32532
> URL: https://issues.apache.org/jira/browse/SPARK-32532
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Muhammad Samir Khan
>Priority: Major
>
> Have some improvements for ORC file format to reduce time taken when 
> reading/writing nested/array'd structs. Using benchmarks in [SPARK-32531] was 
> able to improve performance as follows (measurements in seconds):
> Read:
>  Nested Structs: 184 -> 44
>  Array of Struct: 66 -> 15
> Write
>  Nested Structs: 543 -> 39
>  Array of Struct: 330 -> 37
> Will be putting up the PR soon with the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32532) Improve ORC read/write performance on nested structs and array of structs

2020-08-04 Thread Muhammad Samir Khan (Jira)

Muhammad Samir Khan created SPARK-32532:
---

 Summary: Improve ORC read/write performance on nested structs and 
array of structs
 Key: SPARK-32532
 URL: https://issues.apache.org/jira/browse/SPARK-32532
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Muhammad Samir Khan


Have some improvements for ORC file format to reduce time taken when 
reading/writing nested/array'd structs. Using benchmarks in [SPARK-32071] was 
able to improve performance as follows (measurements in seconds):

Read:
Nested Structs: 184 -> 44
Array of Struct: 66 -> 15

Write
Nested Structs: 543 -> 39
Array of Struct: 330 -> 37

Will be putting up the PR soon with the changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32531) Add benchmarks for nested structs and arrays for different data types

2020-08-04 Thread Muhammad Samir Khan (Jira)

Muhammad Samir Khan created SPARK-32531:
---

 Summary: Add benchmarks for nested structs and arrays for 
different data types
 Key: SPARK-32531
 URL: https://issues.apache.org/jira/browse/SPARK-32531
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.0
Reporter: Muhammad Samir Khan


Additions to benchmarks for different file formats for nested structs and 
arrays which are not being currently benchmarked. I have some improvements for 
ORC and Avro file formats which improve the performance in these cases.

I will be putting up the PRs soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32530) SPIP: Kotlin support for Apache Spark

2020-08-04 Thread Pasha Finkeshteyn (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pasha Finkeshteyn updated SPARK-32530:
--
Issue Type: Improvement  (was: Bug)

> SPIP: Kotlin support for Apache Spark
> -
>
> Key: SPARK-32530
> URL: https://issues.apache.org/jira/browse/SPARK-32530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Pasha Finkeshteyn
>Priority: Major
>
> h2. Background and motivation
> Kotlin is a cross-platform, statically typed, general-purpose JVM language. 
> In the last year more than 5 million developers have used Kotlin in mobile, 
> backend, frontend and scientific development. The number of Kotlin developers 
> grows rapidly every year. 
>  * [According to 
> redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: 
> "Kotlin, the second fastest growing language we’ve seen outside of Swift, 
> made a big splash a year ago at this time when it vaulted eight full spots up 
> the list."
>  * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], 
> Kotlin is the second most popular language on the JVM
>  * [According to 
> StackOverflow|https://insights.stackoverflow.com/survey/2020] Kotlin’s share 
> increased by 7.8% in 2020.
> We notice the increasing usage of Kotlin in data analysis ([6% of users in 
> 2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to 
> 2% in 2019) and machine learning (3% of users in 2020, as opposed to 0% in 
> 2019), and we expect these numbers to continue to grow. 
> We, authors of this SPIP, strongly believe that making Kotlin API officially 
> available to developers can bring new users to Apache Spark and help some of 
> the existing users.
> h2. Goals
> The goal of this project is to bring first-class support for Kotlin language 
> into the Apache Spark project. We’re going to achieve this by adding one more 
> module to the current Apache Spark distribution.
> h2. Non-goals
> There is no goal to replace any existing language support or to change any 
> existing Apache Spark API.
> At this time, there is no goal to support non-core APIs of Apache Spark like 
> Spark ML and Spark structured streaming. This may change in the future based 
> on community feedback.
> There is no goal to provide CLI for Kotlin for Apache Spark, this will be a 
> separate SPIP.
> There is no goal to provide support for Apache Spark < 3.0.0.
> h2. Current implementation
> A working prototype is available at 
> [https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside 
> JetBrains and by early adopters.
> h2. What are the risks?
> There is always a risk that this product won’t get enough popularity and will 
> bring more costs than benefits. It can be mitigated by the fact that we don't 
> need to change any existing API and support can be potentially dropped at any 
> time.
> We also believe that existing API is rather low maintenance. It does not 
> bring anything more complex than already exists in the Spark codebase. 
> Furthermore, the implementation is compact - less than 2000 lines of code.
> We are committed to maintaining, improving and evolving the API based on 
> feedback from both Spark and Kotlin communities. As the Kotlin data community 
> continues to grow, we see Kotlin API for Apache Spark as an important part in 
> the evolving Kotlin ecosystem, and intend to fully support it. 
> h2. How long will it take?
> A  working implementation is already available, and if the community will 
> have any proposal of changes for this implementation to be improved, these 
> can be implemented quickly — in weeks if not days.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32530) SPIP: Kotlin support for Apache Spark

2020-08-04 Thread Pasha Finkeshteyn (Jira)

Pasha Finkeshteyn created SPARK-32530:
-

 Summary: SPIP: Kotlin support for Apache Spark
 Key: SPARK-32530
 URL: https://issues.apache.org/jira/browse/SPARK-32530
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: Pasha Finkeshteyn


h2. Background and motivation

Kotlin is a cross-platform, statically typed, general-purpose JVM language. In 
the last year more than 5 million developers have used Kotlin in mobile, 
backend, frontend and scientific development. The number of Kotlin developers 
grows rapidly every year. 
 * [According to 
redmonk|https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/]: 
"Kotlin, the second fastest growing language we’ve seen outside of Swift, made 
a big splash a year ago at this time when it vaulted eight full spots up the 
list."
 * [According to snyk.io|https://snyk.io/wp-content/uploads/jvm_2020.pdf], 
Kotlin is the second most popular language on the JVM
 * [According to StackOverflow|https://insights.stackoverflow.com/survey/2020] 
Kotlin’s share increased by 7.8% in 2020.

We notice the increasing usage of Kotlin in data analysis ([6% of users in 
2020|https://www.jetbrains.com/lp/devecosystem-2020/kotlin/], as opposed to 2% 
in 2019) and machine learning (3% of users in 2020, as opposed to 0% in 2019), 
and we expect these numbers to continue to grow. 

We, authors of this SPIP, strongly believe that making Kotlin API officially 
available to developers can bring new users to Apache Spark and help some of 
the existing users.
h2. Goals

The goal of this project is to bring first-class support for Kotlin language 
into the Apache Spark project. We’re going to achieve this by adding one more 
module to the current Apache Spark distribution.
h2. Non-goals

There is no goal to replace any existing language support or to change any 
existing Apache Spark API.

At this time, there is no goal to support non-core APIs of Apache Spark like 
Spark ML and Spark structured streaming. This may change in the future based on 
community feedback.

There is no goal to provide CLI for Kotlin for Apache Spark, this will be a 
separate SPIP.

There is no goal to provide support for Apache Spark < 3.0.0.
h2. Current implementation

A working prototype is available at 
[https://github.com/JetBrains/kotlin-spark-api]. It has been tested inside 
JetBrains and by early adopters.
h2. What are the risks?

There is always a risk that this product won’t get enough popularity and will 
bring more costs than benefits. It can be mitigated by the fact that we don't 
need to change any existing API and support can be potentially dropped at any 
time.

We also believe that existing API is rather low maintenance. It does not bring 
anything more complex than already exists in the Spark codebase. Furthermore, 
the implementation is compact - less than 2000 lines of code.

We are committed to maintaining, improving and evolving the API based on 
feedback from both Spark and Kotlin communities. As the Kotlin data community 
continues to grow, we see Kotlin API for Apache Spark as an important part in 
the evolving Kotlin ecosystem, and intend to fully support it. 
h2. How long will it take?

A  working implementation is already available, and if the community will have 
any proposal of changes for this implementation to be improved, these can be 
implemented quickly — in weeks if not days.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost

2020-08-04 Thread Imran Rashid (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-32003:
-
Fix Version/s: 2.4.7

> Shuffle files for lost executor are not unregistered if fetch failure occurs 
> after executor is lost
> ---
>
> Key: SPARK-32003
> URL: https://issues.apache.org/jira/browse/SPARK-32003
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wing Yew Poon
>Assignee: Wing Yew Poon
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> A customer's cluster has a node that goes down while a Spark application is 
> running. (They are running Spark on YARN with the external shuffle service 
> enabled.) An executor is lost (apparently the only one running on the node). 
> This executor lost event is handled in the DAGScheduler, which removes the 
> executor from its BlockManagerMaster. At this point, there is no 
> unregistering of shuffle files for the executor or the node. Soon after, 
> tasks trying to fetch shuffle files output by that executor fail with 
> FetchFailed (because the node is down, there is no NodeManager available to 
> serve shuffle files). By right, such fetch failures should cause the shuffle 
> files for the executor to be unregistered, but they do not.
> Due to task failure, the stage is re-attempted. Tasks continue to fail due to 
> fetch failure form the lost executor's shuffle output. This time, since the 
> failed epoch for the executor is higher, the executor is removed again (this 
> doesn't really do anything, the executor was already removed when it was 
> lost) and this time the shuffle output is unregistered.
> So it takes two stage attempts instead of one to clear the shuffle output. We 
> get 4 attempts by default. The customer was unlucky and two nodes went down 
> during the stage, i.e., the same problem happened twice. So they used up 4 
> stage attempts and the stage failed and thus the job. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost

2020-08-04 Thread Imran Rashid (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171083#comment-17171083
 ] 

Imran Rashid commented on SPARK-32003:
--

Fixed in 2.4.7 by https://github.com/apache/spark/pull/29182

> Shuffle files for lost executor are not unregistered if fetch failure occurs 
> after executor is lost
> ---
>
> Key: SPARK-32003
> URL: https://issues.apache.org/jira/browse/SPARK-32003
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wing Yew Poon
>Assignee: Wing Yew Poon
>Priority: Major
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> A customer's cluster has a node that goes down while a Spark application is 
> running. (They are running Spark on YARN with the external shuffle service 
> enabled.) An executor is lost (apparently the only one running on the node). 
> This executor lost event is handled in the DAGScheduler, which removes the 
> executor from its BlockManagerMaster. At this point, there is no 
> unregistering of shuffle files for the executor or the node. Soon after, 
> tasks trying to fetch shuffle files output by that executor fail with 
> FetchFailed (because the node is down, there is no NodeManager available to 
> serve shuffle files). By right, such fetch failures should cause the shuffle 
> files for the executor to be unregistered, but they do not.
> Due to task failure, the stage is re-attempted. Tasks continue to fail due to 
> fetch failure form the lost executor's shuffle output. This time, since the 
> failed epoch for the executor is higher, the executor is removed again (this 
> doesn't really do anything, the executor was already removed when it was 
> lost) and this time the shuffle output is unregistered.
> So it takes two stage attempts instead of one to clear the shuffle output. We 
> get 4 attempts by default. The customer was unlucky and two nodes went down 
> during the stage, i.e., the same problem happened twice. So they used up 4 
> stage attempts and the stage failed and thus the job. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32527) How to disable port 8080 in Spark?

2020-08-04 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171080#comment-17171080
 ] 

Sean R. Owen commented on SPARK-32527:
--

Yep, SO is a better place for questions. That said I think you just want 
spark.ui.enabled=false

> How to disable port 8080 in Spark?
> --
>
> Key: SPARK-32527
> URL: https://issues.apache.org/jira/browse/SPARK-32527
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Fakrul Razi
>Priority: Minor
>
> I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode. 
> By default when we start master and start slave, port 8080 and 8081 will open 
> automatically from SPARK application.
> Due to security constraint, I would like to disable Spark web UI for master 
> (8080) and all workers (8081).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32527) How to disable port 8080 in Spark?

2020-08-04 Thread Rohit Mishra (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohit Mishra updated SPARK-32527:
-
Priority: Minor  (was: Critical)

> How to disable port 8080 in Spark?
> --
>
> Key: SPARK-32527
> URL: https://issues.apache.org/jira/browse/SPARK-32527
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Fakrul Razi
>Priority: Minor
>
> I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode. 
> By default when we start master and start slave, port 8080 and 8081 will open 
> automatically from SPARK application.
> Due to security constraint, I would like to disable Spark web UI for master 
> (8080) and all workers (8081).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32527) How to disable port 8080 in Spark?

2020-08-04 Thread Rohit Mishra (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohit Mishra resolved SPARK-32527.
--
Resolution: Not A Problem

> How to disable port 8080 in Spark?
> --
>
> Key: SPARK-32527
> URL: https://issues.apache.org/jira/browse/SPARK-32527
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Fakrul Razi
>Priority: Critical
>
> I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode. 
> By default when we start master and start slave, port 8080 and 8081 will open 
> automatically from SPARK application.
> Due to security constraint, I would like to disable Spark web UI for master 
> (8080) and all workers (8081).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32527) How to disable port 8080 in Spark?

2020-08-04 Thread Rohit Mishra (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171069#comment-17171069
 ] 

Rohit Mishra commented on SPARK-32527:
--

[~shinudin],
 * Please use stack overflow for any question. Kindly check this document to 
understand the best practices - http://spark.apache.org/community.html
 * Please don't mark priority as critical, these are mostly used by committers. 
 * This issue will be marked resolved for now. 

Thanks. 

> How to disable port 8080 in Spark?
> --
>
> Key: SPARK-32527
> URL: https://issues.apache.org/jira/browse/SPARK-32527
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Fakrul Razi
>Priority: Critical
>
> I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode. 
> By default when we start master and start slave, port 8080 and 8081 will open 
> automatically from SPARK application.
> Due to security constraint, I would like to disable Spark web UI for master 
> (8080) and all workers (8081).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32527) How to disable port 8080 in Spark?

2020-08-04 Thread Rohit Mishra (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171069#comment-17171069
 ] 

Rohit Mishra edited comment on SPARK-32527 at 8/4/20, 7:16 PM:
---

[~shinudin],
 * Please use stack overflow for any question or use User mail list. Kindly 
check this document to understand the best practices - 
[http://spark.apache.org/community.html]
 * Please don't mark priority as critical, these are mostly used by committers. 
 * This issue will be marked resolved for now. 

Thanks. 


was (Author: rohitmishr1484):
[~shinudin],
 * Please use stack overflow for any question. Kindly check this document to 
understand the best practices - http://spark.apache.org/community.html
 * Please don't mark priority as critical, these are mostly used by committers. 
 * This issue will be marked resolved for now. 

Thanks. 

> How to disable port 8080 in Spark?
> --
>
> Key: SPARK-32527
> URL: https://issues.apache.org/jira/browse/SPARK-32527
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Fakrul Razi
>Priority: Critical
>
> I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode. 
> By default when we start master and start slave, port 8080 and 8081 will open 
> automatically from SPARK application.
> Due to security constraint, I would like to disable Spark web UI for master 
> (8080) and all workers (8081).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32528) The analyze method should make sure the plan is analyzed

2020-08-04 Thread Rohit Mishra (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171062#comment-17171062
 ] 

Rohit Mishra commented on SPARK-32528:
--

[~cloud_fan], Can you please populate the description field?

> The analyze method should make sure the plan is analyzed
> 
>
> Key: SPARK-32528
> URL: https://issues.apache.org/jira/browse/SPARK-32528
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32427) Omit USING in CREATE TABLE via JDBC Table Catalog

2020-08-04 Thread Huaxin Gao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171020#comment-17171020
 ] 

Huaxin Gao commented on SPARK-32427:


[~maxgekk]
I made the tableProvider to be optional to omit USING, but now it seems the 
CREATE TABLE SYNTAX has ambiguity. For example, the following CREATE TABLE is 
supposed to create a Hive table, but after my change it creates a data source 
table.

{code:java}
s"""CREATE TABLE t1 (
   |  c1 INT COMMENT 'bla',
   |  c2 STRING
   |)
   |TBLPROPERTIES (
   |  'prop1' = 'value1',
   |  'prop2' = 'value2'
   |)
 """.stripMargin
{code}


> Omit USING in CREATE TABLE via JDBC Table Catalog
> -
>
> Key: SPARK-32427
> URL: https://issues.apache.org/jira/browse/SPARK-32427
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Support creating tables in JDBC Table Catalog without USING, for instance:
> {code:sql}
> CREATE TABLE h2.test.new_table(i INT, j STRING)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost

2020-08-04 Thread Imran Rashid (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170945#comment-17170945
 ] 

Imran Rashid commented on SPARK-32003:
--

Fixed in 3.0.1 by https://github.com/apache/spark/pull/29193

> Shuffle files for lost executor are not unregistered if fetch failure occurs 
> after executor is lost
> ---
>
> Key: SPARK-32003
> URL: https://issues.apache.org/jira/browse/SPARK-32003
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wing Yew Poon
>Assignee: Wing Yew Poon
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> A customer's cluster has a node that goes down while a Spark application is 
> running. (They are running Spark on YARN with the external shuffle service 
> enabled.) An executor is lost (apparently the only one running on the node). 
> This executor lost event is handled in the DAGScheduler, which removes the 
> executor from its BlockManagerMaster. At this point, there is no 
> unregistering of shuffle files for the executor or the node. Soon after, 
> tasks trying to fetch shuffle files output by that executor fail with 
> FetchFailed (because the node is down, there is no NodeManager available to 
> serve shuffle files). By right, such fetch failures should cause the shuffle 
> files for the executor to be unregistered, but they do not.
> Due to task failure, the stage is re-attempted. Tasks continue to fail due to 
> fetch failure form the lost executor's shuffle output. This time, since the 
> failed epoch for the executor is higher, the executor is removed again (this 
> doesn't really do anything, the executor was already removed when it was 
> lost) and this time the shuffle output is unregistered.
> So it takes two stage attempts instead of one to clear the shuffle output. We 
> get 4 attempts by default. The customer was unlucky and two nodes went down 
> during the stage, i.e., the same problem happened twice. So they used up 4 
> stage attempts and the stage failed and thus the job. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost

2020-08-04 Thread Imran Rashid (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-32003:
-
Fix Version/s: 3.0.1

> Shuffle files for lost executor are not unregistered if fetch failure occurs 
> after executor is lost
> ---
>
> Key: SPARK-32003
> URL: https://issues.apache.org/jira/browse/SPARK-32003
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wing Yew Poon
>Assignee: Wing Yew Poon
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> A customer's cluster has a node that goes down while a Spark application is 
> running. (They are running Spark on YARN with the external shuffle service 
> enabled.) An executor is lost (apparently the only one running on the node). 
> This executor lost event is handled in the DAGScheduler, which removes the 
> executor from its BlockManagerMaster. At this point, there is no 
> unregistering of shuffle files for the executor or the node. Soon after, 
> tasks trying to fetch shuffle files output by that executor fail with 
> FetchFailed (because the node is down, there is no NodeManager available to 
> serve shuffle files). By right, such fetch failures should cause the shuffle 
> files for the executor to be unregistered, but they do not.
> Due to task failure, the stage is re-attempted. Tasks continue to fail due to 
> fetch failure form the lost executor's shuffle output. This time, since the 
> failed epoch for the executor is higher, the executor is removed again (this 
> doesn't really do anything, the executor was already removed when it was 
> lost) and this time the shuffle output is unregistered.
> So it takes two stage attempts instead of one to clear the shuffle output. We 
> get 4 attempts by default. The customer was unlucky and two nodes went down 
> during the stage, i.e., the same problem happened twice. So they used up 4 
> stage attempts and the stage failed and thus the job. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32529) Spark 3.0 History Server May Never Finish One Round Log Dir Scan

2020-08-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32529:


Assignee: (was: Apache Spark)

> Spark 3.0 History Server May Never Finish One Round Log Dir Scan
> 
>
> Key: SPARK-32529
> URL: https://issues.apache.org/jira/browse/SPARK-32529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yan Xiaole
>Priority: Major
>
> If there are a large number (>100k) of applications log dir, listing the log 
> dir will take a few seconds. After getting the path list some applications 
> might have finished already, and the filename will change from 
> "foo.inprogress" to "foo".
> It leads to a problem when adding an entry to the listing, querying file 
> status like `fileSizeForLastIndex` will throw out a `FileNotFoundException` 
> exception if the application was finished. And the exception will abort 
> current loop, in a busy cluster, it will make history server couldn't list 
> and load any application log.
>  
>  
> {code:java}
> 20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event 
> log updates
>  java.io.FileNotFoundException: File does not exist: 
> hdfs://xx/logs/spark/application_11.lz4.inprogress
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520)
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520)
>  at 
> org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170)
>  at 
> org.apache.spark.deploy.history.SingleFileEventLogFileReader.status(EventLogFileReaders.scala:170)
>  at 
> org.apache.spark.deploy.history.SingleFileEventLogFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:174)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:523)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466)
>  at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
>  at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
>  at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>  at scala.collection.TraversableLike.filter(TraversableLike.scala:347)
>  at scala.collection.TraversableLike.filter$(TraversableLike.scala:347)
>  at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287)
>  at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748){code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32529) Spark 3.0 History Server May Never Finish One Round Log Dir Scan

2020-08-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32529:


Assignee: Apache Spark

> Spark 3.0 History Server May Never Finish One Round Log Dir Scan
> 
>
> Key: SPARK-32529
> URL: https://issues.apache.org/jira/browse/SPARK-32529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yan Xiaole
>Assignee: Apache Spark
>Priority: Major
>
> If there are a large number (>100k) of applications log dir, listing the log 
> dir will take a few seconds. After getting the path list some applications 
> might have finished already, and the filename will change from 
> "foo.inprogress" to "foo".
> It leads to a problem when adding an entry to the listing, querying file 
> status like `fileSizeForLastIndex` will throw out a `FileNotFoundException` 
> exception if the application was finished. And the exception will abort 
> current loop, in a busy cluster, it will make history server couldn't list 
> and load any application log.
>  
>  
> {code:java}
> 20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event 
> log updates
>  java.io.FileNotFoundException: File does not exist: 
> hdfs://xx/logs/spark/application_11.lz4.inprogress
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520)
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520)
>  at 
> org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170)
>  at 
> org.apache.spark.deploy.history.SingleFileEventLogFileReader.status(EventLogFileReaders.scala:170)
>  at 
> org.apache.spark.deploy.history.SingleFileEventLogFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:174)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:523)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466)
>  at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
>  at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
>  at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>  at scala.collection.TraversableLike.filter(TraversableLike.scala:347)
>  at scala.collection.TraversableLike.filter$(TraversableLike.scala:347)
>  at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287)
>  at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748){code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32529) Spark 3.0 History Server May Never Finish One Round Log Dir Scan

2020-08-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170926#comment-17170926
 ] 

Apache Spark commented on SPARK-32529:
--

User 'yanxiaole' has created a pull request for this issue:
https://github.com/apache/spark/pull/29350

> Spark 3.0 History Server May Never Finish One Round Log Dir Scan
> 
>
> Key: SPARK-32529
> URL: https://issues.apache.org/jira/browse/SPARK-32529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yan Xiaole
>Priority: Major
>
> If there are a large number (>100k) of applications log dir, listing the log 
> dir will take a few seconds. After getting the path list some applications 
> might have finished already, and the filename will change from 
> "foo.inprogress" to "foo".
> It leads to a problem when adding an entry to the listing, querying file 
> status like `fileSizeForLastIndex` will throw out a `FileNotFoundException` 
> exception if the application was finished. And the exception will abort 
> current loop, in a busy cluster, it will make history server couldn't list 
> and load any application log.
>  
>  
> {code:java}
> 20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event 
> log updates
>  java.io.FileNotFoundException: File does not exist: 
> hdfs://xx/logs/spark/application_11.lz4.inprogress
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520)
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520)
>  at 
> org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170)
>  at 
> org.apache.spark.deploy.history.SingleFileEventLogFileReader.status(EventLogFileReaders.scala:170)
>  at 
> org.apache.spark.deploy.history.SingleFileEventLogFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:174)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:523)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466)
>  at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
>  at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
>  at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>  at scala.collection.TraversableLike.filter(TraversableLike.scala:347)
>  at scala.collection.TraversableLike.filter$(TraversableLike.scala:347)
>  at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287)
>  at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748){code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32529) Spark 3.0 History Server May Never Finish One Round Log Dir Scan

2020-08-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170925#comment-17170925
 ] 

Apache Spark commented on SPARK-32529:
--

User 'yanxiaole' has created a pull request for this issue:
https://github.com/apache/spark/pull/29350

> Spark 3.0 History Server May Never Finish One Round Log Dir Scan
> 
>
> Key: SPARK-32529
> URL: https://issues.apache.org/jira/browse/SPARK-32529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yan Xiaole
>Priority: Major
>
> If there are a large number (>100k) of applications log dir, listing the log 
> dir will take a few seconds. After getting the path list some applications 
> might have finished already, and the filename will change from 
> "foo.inprogress" to "foo".
> It leads to a problem when adding an entry to the listing, querying file 
> status like `fileSizeForLastIndex` will throw out a `FileNotFoundException` 
> exception if the application was finished. And the exception will abort 
> current loop, in a busy cluster, it will make history server couldn't list 
> and load any application log.
>  
>  
> {code:java}
> 20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event 
> log updates
>  java.io.FileNotFoundException: File does not exist: 
> hdfs://xx/logs/spark/application_11.lz4.inprogress
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520)
>  at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>  at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520)
>  at 
> org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170)
>  at 
> org.apache.spark.deploy.history.SingleFileEventLogFileReader.status(EventLogFileReaders.scala:170)
>  at 
> org.apache.spark.deploy.history.SingleFileEventLogFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:174)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:523)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466)
>  at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>  at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
>  at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
>  at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>  at scala.collection.TraversableLike.filter(TraversableLike.scala:347)
>  at scala.collection.TraversableLike.filter$(TraversableLike.scala:347)
>  at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287)
>  at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748){code}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32525) The layout of monitoring.html is broken

2020-08-04 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-32525.

Resolution: Fixed

This issue is resolved in https://github.com/apache/spark/pull/29345

> The layout of monitoring.html is broken
> ---
>
> Key: SPARK-32525
> URL: https://issues.apache.org/jira/browse/SPARK-32525
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> The layout of monitoring.html is broken because there are 2  tags not 
> closed in monitoring.md.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32529) Spark 3.0 History Server May Never Finish One Round Log Dir Scan

2020-08-04 Thread Yan Xiaole (Jira)

Yan Xiaole created SPARK-32529:
--

 Summary: Spark 3.0 History Server May Never Finish One Round Log 
Dir Scan
 Key: SPARK-32529
 URL: https://issues.apache.org/jira/browse/SPARK-32529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Yan Xiaole


If there are a large number (>100k) of applications log dir, listing the log 
dir will take a few seconds. After getting the path list some applications 
might have finished already, and the filename will change from "foo.inprogress" 
to "foo".

It leads to a problem when adding an entry to the listing, querying file status 
like `fileSizeForLastIndex` will throw out a `FileNotFoundException` exception 
if the application was finished. And the exception will abort current loop, in 
a busy cluster, it will make history server couldn't list and load any 
application log.

 

 
{code:java}
20/08/03 15:17:23 ERROR FsHistoryProvider: Exception in checking for event log 
updates
 java.io.FileNotFoundException: File does not exist: 
hdfs://xx/logs/spark/application_11.lz4.inprogress
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1527)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1520)
 at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1520)
 at 
org.apache.spark.deploy.history.SingleFileEventLogFileReader.status$lzycompute(EventLogFileReaders.scala:170)
 at 
org.apache.spark.deploy.history.SingleFileEventLogFileReader.status(EventLogFileReaders.scala:170)
 at 
org.apache.spark.deploy.history.SingleFileEventLogFileReader.fileSizeForLastIndex(EventLogFileReaders.scala:174)
 at 
org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7(FsHistoryProvider.scala:523)
 at 
org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$checkForLogs$7$adapted(FsHistoryProvider.scala:466)
 at 
scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:256)
 at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
 at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
 at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
 at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
 at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
 at scala.collection.TraversableLike.filter(TraversableLike.scala:347)
 at scala.collection.TraversableLike.filter$(TraversableLike.scala:347)
 at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
 at 
org.apache.spark.deploy.history.FsHistoryProvider.checkForLogs(FsHistoryProvider.scala:466)
 at 
org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$startPolling$3(FsHistoryProvider.scala:287)
 at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1302)
 at 
org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$getRunner$1(FsHistoryProvider.scala:210)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748){code}
 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32526) Let sql/catalyst module tests pass for Scala 2.13

2020-08-04 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-32526:
-
Summary: Let sql/catalyst module tests pass for Scala 2.13  (was: Let 
sql/catalyst module compile for Scala 2.13)

> Let sql/catalyst module tests pass for Scala 2.13
> -
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: catalyst-failed-cases
>
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
> {code}
> Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq 
> on these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32499) Use {} for structs and maps in show()

2020-08-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32499.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29308
[https://github.com/apache/spark/pull/29308]

> Use {} for structs and maps in show()
> -
>
> Key: SPARK-32499
> URL: https://issues.apache.org/jira/browse/SPARK-32499
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently, show() wraps arrays, maps and structs by []. Maps and structs 
> should be wrapped by {}:
> - To be consistent with ToHiveResult
> - To distinguish maps/structs from arrays 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32499) Use {} for structs and maps in show()

2020-08-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32499:
---

Assignee: Maxim Gekk

> Use {} for structs and maps in show()
> -
>
> Key: SPARK-32499
> URL: https://issues.apache.org/jira/browse/SPARK-32499
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Currently, show() wraps arrays, maps and structs by []. Maps and structs 
> should be wrapped by {}:
> - To be consistent with ToHiveResult
> - To distinguish maps/structs from arrays 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32528) The analyze method should make sure the plan is analyzed

2020-08-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170843#comment-17170843
 ] 

Apache Spark commented on SPARK-32528:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29349

> The analyze method should make sure the plan is analyzed
> 
>
> Key: SPARK-32528
> URL: https://issues.apache.org/jira/browse/SPARK-32528
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32528) The analyze method should make sure the plan is analyzed

2020-08-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32528:


Assignee: Apache Spark  (was: Wenchen Fan)

> The analyze method should make sure the plan is analyzed
> 
>
> Key: SPARK-32528
> URL: https://issues.apache.org/jira/browse/SPARK-32528
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32528) The analyze method should make sure the plan is analyzed

2020-08-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32528:


Assignee: Wenchen Fan  (was: Apache Spark)

> The analyze method should make sure the plan is analyzed
> 
>
> Key: SPARK-32528
> URL: https://issues.apache.org/jira/browse/SPARK-32528
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32528) The analyze method should make sure the plan is analyzed

2020-08-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170842#comment-17170842
 ] 

Apache Spark commented on SPARK-32528:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29349

> The analyze method should make sure the plan is analyzed
> 
>
> Key: SPARK-32528
> URL: https://issues.apache.org/jira/browse/SPARK-32528
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32528) The analyze method should make sure the plan is analyzed

2020-08-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32528:

Summary: The analyze method should make sure the plan is analyzed  (was: 
The analyze method make sure the plan is analyzed)

> The analyze method should make sure the plan is analyzed
> 
>
> Key: SPARK-32528
> URL: https://issues.apache.org/jira/browse/SPARK-32528
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32528) The analyze method make sure the plan is analyzed

2020-08-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32528:
---

Assignee: Wenchen Fan

> The analyze method make sure the plan is analyzed
> -
>
> Key: SPARK-32528
> URL: https://issues.apache.org/jira/browse/SPARK-32528
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32528) The analyze method make sure the plan is analyzed

2020-08-04 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-32528:
---

 Summary: The analyze method make sure the plan is analyzed
 Key: SPARK-32528
 URL: https://issues.apache.org/jira/browse/SPARK-32528
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.1.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation

2020-08-04 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170817#comment-17170817
 ] 

Thomas Graves commented on SPARK-32037:
---

allowlist and blocklist have been used by others. Seems we may only need 
blocklist.  I'm hesitant with healthtracker as it could be used for other 
health checks but it does sound better.

[https://github.com/golang/go/commit/608cdcaede1e7133dc994b5e8894272c2dce744b]

[https://9to5google.com/2020/06/12/google-android-chrome-blacklist-blocklist-more-inclusive/]

[https://bugzilla.mozilla.org/show_bug.cgi?id=1571734]

 

DenyList:

https://issues.apache.org/jira/browse/GEODE-5685

[https://github.com/nodejs/node/pull/33813]

> Rename blacklisting feature to avoid language with racist connotation
> -
>
> Key: SPARK-32037
> URL: https://issues.apache.org/jira/browse/SPARK-32037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Minor
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist". 
> While it seems to me that there is some valid debate as to whether this term 
> has racist origins, the cultural connotations are inescapable in today's 
> world.
> I've created a separate task, SPARK-32036, to remove references outside of 
> this feature. Given the large surface area of this feature and the 
> public-facing UI / configs / etc., more care will need to be taken here.
> I'd like to start by opening up debate on what the best replacement name 
> would be. Reject-/deny-/ignore-/block-list are common replacements for 
> "blacklist", but I'm not sure that any of them work well for this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23431) Expose the new executor memory metrics at the stage level

2020-08-04 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-23431.

Resolution: Fixed

The issue is resolved in https://github.com/apache/spark/pull/29020

> Expose the new executor memory metrics at the stage level
> -
>
> Key: SPARK-23431
> URL: https://issues.apache.org/jira/browse/SPARK-23431
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Assignee: Terry Kim
>Priority: Major
>
> Collect and show the new executor memory metrics for each stage, to provide 
> more information on how memory is used per stage.
> Modify the AppStatusListener to track the peak values for JVM used memory, 
> execution memory, storage memory, and unified memory for each executor for 
> each stage.
> This is a subtask for SPARK-23206. Please refer to the design doc for that 
> ticket for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23431) Expose the new executor memory metrics at the stage level

2020-08-04 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-23431:
--

Assignee: Terry Kim

> Expose the new executor memory metrics at the stage level
> -
>
> Key: SPARK-23431
> URL: https://issues.apache.org/jira/browse/SPARK-23431
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Assignee: Terry Kim
>Priority: Major
>
> Collect and show the new executor memory metrics for each stage, to provide 
> more information on how memory is used per stage.
> Modify the AppStatusListener to track the peak values for JVM used memory, 
> execution memory, storage memory, and unified memory for each executor for 
> each stage.
> This is a subtask for SPARK-23206. Please refer to the design doc for that 
> ticket for more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13

2020-08-04 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-32526:
-
Parent: SPARK-25075
Issue Type: Sub-task  (was: Task)

> Let sql/catalyst module compile for Scala 2.13
> --
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: catalyst-failed-cases
>
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
> {code}
> Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq 
> on these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13

2020-08-04 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170754#comment-17170754
 ] 

Sean R. Owen commented on SPARK-32526:
--

Yes, I know [~dongjoon] has also been working on tests and has up to core 
passing, so this would be next.

> Let sql/catalyst module compile for Scala 2.13
> --
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: catalyst-failed-cases
>
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
> {code}
> Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq 
> on these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32524) SharedSparkSession should clean up InMemoryRelation.ser

2020-08-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32524:


Assignee: Dongjoon Hyun

> SharedSparkSession should clean up InMemoryRelation.ser 
> 
>
> Key: SPARK-32524
> URL: https://issues.apache.org/jira/browse/SPARK-32524
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32524) SharedSparkSession should clean up InMemoryRelation.ser

2020-08-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32524.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29346
[https://github.com/apache/spark/pull/29346]

> SharedSparkSession should clean up InMemoryRelation.ser 
> 
>
> Key: SPARK-32524
> URL: https://issues.apache.org/jira/browse/SPARK-32524
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32527) How to disable port 8080 in Spark?

2020-08-04 Thread Fakrul Razi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fakrul Razi updated SPARK-32527:

Description: 
I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode. By 
default when we start master and start slave, port 8080 and 8081 will open 
automatically from SPARK application.
Due to security constraint, I would like to disable Spark web UI for master 
(8080) and all workers (8081).

  was:
I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode.
I want to disable Spark web UI for master (8080) and all workers (8081).


> How to disable port 8080 in Spark?
> --
>
> Key: SPARK-32527
> URL: https://issues.apache.org/jira/browse/SPARK-32527
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Fakrul Razi
>Priority: Critical
>
> I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode. 
> By default when we start master and start slave, port 8080 and 8081 will open 
> automatically from SPARK application.
> Due to security constraint, I would like to disable Spark web UI for master 
> (8080) and all workers (8081).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32527) How to disable port 8080 in Spark?

2020-08-04 Thread Fakrul Razi (Jira)

Fakrul Razi created SPARK-32527:
---

 Summary: How to disable port 8080 in Spark?
 Key: SPARK-32527
 URL: https://issues.apache.org/jira/browse/SPARK-32527
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 2.3.3
Reporter: Fakrul Razi


I am running Apache Spark 2.3.3 in Standalone Mode with client deploy mode.
I want to disable Spark web UI for master (8080) and all workers (8081).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13

2020-08-04 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170640#comment-17170640
 ] 

Yang Jie edited comment on SPARK-32526 at 8/4/20, 8:01 AM:
---

There are 97 test case FAILED and 3 test case ABORTED in sql/catalyst module 
with scala-2.13 profile, the failures list is in the attachment, which needs to 
be fixed later.

In addition,  encodedecodetest in  RowEncoderSuite is ignored because it will 
generate a large number of error characters.


was (Author: luciferyang):
There are 97 test caseFAILED and 3 test case ABORTED in sql/catalyst module 
with scala-2.13 profile, the failures list is in the attachment, which needs to 
be fixed later.

 In addition,  encodedecodetest in  RowEncoderSuite is ignored because it will 
generate a large number of error characters.

> Let sql/catalyst module compile for Scala 2.13
> --
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: catalyst-failed-cases
>
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
> {code}
> Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq 
> on these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13

2020-08-04 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170640#comment-17170640
 ] 

Yang Jie commented on SPARK-32526:
--

There are 97 test caseFAILED and 3 test case ABORTED in sql/catalyst module 
with scala-2.13 profile, the failures list is in the attachment, which needs to 
be fixed later.

 In addition,  encodedecodetest in  RowEncoderSuite is ignored because it will 
generate a large number of error characters.

> Let sql/catalyst module compile for Scala 2.13
> --
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: catalyst-failed-cases
>
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
> {code}
> Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq 
> on these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13

2020-08-04 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-32526:
-
Attachment: catalyst-failed-cases

> Let sql/catalyst module compile for Scala 2.13
> --
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: catalyst-failed-cases
>
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
> {code}
> Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq 
> on these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13

2020-08-04 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170626#comment-17170626
 ] 

Yang Jie commented on SPARK-32526:
--

[~srowen] Can this issue be a sub-task of 
https://issues.apache.org/jira/browse/SPARK-25075?

> Let sql/catalyst module compile for Scala 2.13
> --
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
> {code}
> Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq 
> on these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13

2020-08-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32526:


Assignee: (was: Apache Spark)

> Let sql/catalyst module compile for Scala 2.13
> --
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
> {code}
> Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq 
> on these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13

2020-08-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170625#comment-17170625
 ] 

Apache Spark commented on SPARK-32526:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29348

> Let sql/catalyst module compile for Scala 2.13
> --
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
> {code}
> Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq 
> on these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13

2020-08-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32526:


Assignee: Apache Spark

> Let sql/catalyst module compile for Scala 2.13
> --
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
> {code}
> Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq 
> on these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13

2020-08-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170624#comment-17170624
 ] 

Apache Spark commented on SPARK-32526:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29348

> Let sql/catalyst module compile for Scala 2.13
> --
>
> Key: SPARK-32526
> URL: https://issues.apache.org/jira/browse/SPARK-32526
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Minor
>
> sql/catalyst module has following compile errors with scala-2.13 profile:
> {code:java}
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)]
>  required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]
> [INFO] [Info] : 
> scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
>  org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
> Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
> org.apache.spark.sql.catalyst.expressions.Attribute)]?
> [INFO] [Info] : false
> [ERROR] [Error] 
> /Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
>  type mismatch;
>  found   : 
> scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
>  required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
> {code}
> Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq 
> on these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32492) Fulfill missing column meta information for thrift server client tools

2020-08-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170619#comment-17170619
 ] 

Apache Spark commented on SPARK-32492:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29347

> Fulfill missing column meta information for thrift server client tools
> --
>
> Key: SPARK-32492
> URL: https://issues.apache.org/jira/browse/SPARK-32492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: wx20200730-175...@2x.png
>
>
> Most fields of a column are missing， e.g. position, column-size
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32526) Let sql/catalyst module compile for Scala 2.13

2020-08-04 Thread Yang Jie (Jira)

Yang Jie created SPARK-32526:


 Summary: Let sql/catalyst module compile for Scala 2.13
 Key: SPARK-32526
 URL: https://issues.apache.org/jira/browse/SPARK-32526
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yang Jie


sql/catalyst module has following compile errors with scala-2.13 profile:
{code:java}
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1284:
 type mismatch;
 found   : 
scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
 org.apache.spark.sql.catalyst.expressions.Attribute)]
 required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
org.apache.spark.sql.catalyst.expressions.Attribute)]
[INFO] [Info] : 
scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
 org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
org.apache.spark.sql.catalyst.expressions.Attribute)]?
[INFO] [Info] : false
[ERROR] [Error] 
/Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1289:
 type mismatch;
 found   : 
scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
 org.apache.spark.sql.catalyst.expressions.Attribute)]
 required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]
[INFO] [Info] : 
scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
 org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, ?)]?
[INFO] [Info] : false
[ERROR] [Error] 
/Users/yangjie01/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:1297:
 type mismatch;
 found   : 
scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
 org.apache.spark.sql.catalyst.expressions.Attribute)]
 required: Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
org.apache.spark.sql.catalyst.expressions.Attribute)]
[INFO] [Info] : 
scala.collection.mutable.ArrayBuffer[(org.apache.spark.sql.catalyst.expressions.Attribute,
 org.apache.spark.sql.catalyst.expressions.Attribute)] <: 
Seq[(org.apache.spark.sql.catalyst.expressions.Attribute, 
org.apache.spark.sql.catalyst.expressions.Attribute)]?
[INFO] [Info] : false
[ERROR] [Error] 
/Users/baidu/SourceCode/git/spark-mine/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:952:
 type mismatch;
 found   : 
scala.collection.mutable.ArrayBuffer[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
 required: Seq[org.apache.spark.sql.catalyst.plans.logical.LogicalPlan]
{code}
Similar as https://issues.apache.org/jira/browse/SPARK-29292 ,  call .toSeq on 
these to ensue they still works on 2.12.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32492) Fulfill missing column meta information for thrift server client tools

2020-08-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170618#comment-17170618
 ] 

Apache Spark commented on SPARK-32492:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29347

> Fulfill missing column meta information for thrift server client tools
> --
>
> Key: SPARK-32492
> URL: https://issues.apache.org/jira/browse/SPARK-32492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: wx20200730-175...@2x.png
>
>
> Most fields of a column are missing， e.g. position, column-size
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

96 matches

Mail list logo