date:20200702

[jira] [Created] (SPARK-32152) ./bin/spark-sql got error with reading hive metastore

2020-07-02 Thread jung bak (Jira)

jung bak created SPARK-32152:


 Summary: ./bin/spark-sql got error with reading hive metastore
 Key: SPARK-32152
 URL: https://issues.apache.org/jira/browse/SPARK-32152
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: Spark 3.0.0

Hive 2.1.1
Reporter: jung bak


1. Fist of all, I built Spark3.0.0 from source with below command.
{quote}{{./build/mvn -Pyarn -Phive -Phive-thriftserver -Dskip Tests clean 
package}}
{quote}
2. I set the ${SPARK_HOME}/conf/spark-defaults.conf as below.
{quote}spark.sql.hive.metastore.version    2.1.1

spark.sql.hive.metastore.jars    {color:#FF}maven{color}
{quote}
3. There is no problem to run "${SPARK_HOME}/bin/spark-sql"

4. For production environment, I copied all downloaded jar files from maven to 
${SPARK_HOME}/lib/

5. I changed ${SPARK_HOME}/conf/spark-defaluts.conf as below.
{quote}spark.sql.hive.metastore.jars   {color:#FF}${SPARK_HOME}/lib/{color}
{quote}
6. Then I got error running command ./bin/spark-sql as below.
{quote}Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/ql/metadata/HiveException
{quote}
I found out that HiveException class is in the hive-exec-XXX.jar...

Spark 3.0.0 was built with hive 2.3.7 by default, and I could find 
"hive-exec-2.3.7-core.jar" after I finished. and I could find 
hive-exec-2.1.1.jar downloaded from maven when I use 
"spark.sql.hive.metastore.jars maven" in the spark-defaults.conf.

 

I thought that there are some conflict between hive 2.1.1 and hive 2.3.7 when I 
set the {color:#7a869a}spark.sql.hive.metastore.jars   
${SPARK_HOME}/lib/.{color}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32152) ./bin/spark-sql got error with reading hive metastore

2020-07-02 Thread jung bak (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jung bak updated SPARK-32152:
-
Description: 
1. Fist of all, I built Spark3.0.0 from source with below command.
{quote}{{./build/mvn -Pyarn -Phive -Phive-thriftserver -Dskip Tests clean 
package}}
{quote}
2. I set the ${SPARK_HOME}/conf/spark-defaults.conf as below.
{quote}spark.sql.hive.metastore.version    2.1.1

spark.sql.hive.metastore.jars    {color:#ff}maven{color}
{quote}
3. There is no problem to run "${SPARK_HOME}/bin/spark-sql"

4. For production environment, I copied all downloaded jar files from maven to 
${SPARK_HOME}/lib/

5. I changed ${SPARK_HOME}/conf/spark-defaluts.conf as below.
{quote}spark.sql.hive.metastore.jars   {color:#ff}${SPARK_HOME}/lib/{color}
{quote}
6. Then I got error running command ./bin/spark-sql as below.
{quote}Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/ql/metadata/HiveException
{quote}
I found out that HiveException class is in the hive-exec-XXX.jar...

Spark 3.0.0 was built with hive 2.3.7 by default, and I could find 
"hive-exec-2.3.7-core.jar" after I finished build. And I could find 
hive-exec-2.1.1.jar downloaded from maven when I use 
"spark.sql.hive.metastore.jars maven" in the spark-defaults.conf.

 

I thought that there are some conflict between hive 2.1.1 and hive 2.3.7 when I 
set the {color:#7a869a}spark.sql.hive.metastore.jars   
${SPARK_HOME}/lib/.{color}

 

  was:
1. Fist of all, I built Spark3.0.0 from source with below command.
{quote}{{./build/mvn -Pyarn -Phive -Phive-thriftserver -Dskip Tests clean 
package}}
{quote}
2. I set the ${SPARK_HOME}/conf/spark-defaults.conf as below.
{quote}spark.sql.hive.metastore.version    2.1.1

spark.sql.hive.metastore.jars    {color:#FF}maven{color}
{quote}
3. There is no problem to run "${SPARK_HOME}/bin/spark-sql"

4. For production environment, I copied all downloaded jar files from maven to 
${SPARK_HOME}/lib/

5. I changed ${SPARK_HOME}/conf/spark-defaluts.conf as below.
{quote}spark.sql.hive.metastore.jars   {color:#FF}${SPARK_HOME}/lib/{color}
{quote}
6. Then I got error running command ./bin/spark-sql as below.
{quote}Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/hadoop/hive/ql/metadata/HiveException
{quote}
I found out that HiveException class is in the hive-exec-XXX.jar...

Spark 3.0.0 was built with hive 2.3.7 by default, and I could find 
"hive-exec-2.3.7-core.jar" after I finished. and I could find 
hive-exec-2.1.1.jar downloaded from maven when I use 
"spark.sql.hive.metastore.jars maven" in the spark-defaults.conf.

 

I thought that there are some conflict between hive 2.1.1 and hive 2.3.7 when I 
set the {color:#7a869a}spark.sql.hive.metastore.jars   
${SPARK_HOME}/lib/.{color}

 


> ./bin/spark-sql got error with reading hive metastore
> -
>
> Key: SPARK-32152
> URL: https://issues.apache.org/jira/browse/SPARK-32152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0
> Hive 2.1.1
>Reporter: jung bak
>Priority: Major
>
> 1. Fist of all, I built Spark3.0.0 from source with below command.
> {quote}{{./build/mvn -Pyarn -Phive -Phive-thriftserver -Dskip Tests clean 
> package}}
> {quote}
> 2. I set the ${SPARK_HOME}/conf/spark-defaults.conf as below.
> {quote}spark.sql.hive.metastore.version    2.1.1
> spark.sql.hive.metastore.jars    {color:#ff}maven{color}
> {quote}
> 3. There is no problem to run "${SPARK_HOME}/bin/spark-sql"
> 4. For production environment, I copied all downloaded jar files from maven 
> to ${SPARK_HOME}/lib/
> 5. I changed ${SPARK_HOME}/conf/spark-defaluts.conf as below.
> {quote}spark.sql.hive.metastore.jars   
> {color:#ff}${SPARK_HOME}/lib/{color}
> {quote}
> 6. Then I got error running command ./bin/spark-sql as below.
> {quote}Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/hadoop/hive/ql/metadata/HiveException
> {quote}
> I found out that HiveException class is in the hive-exec-XXX.jar...
> Spark 3.0.0 was built with hive 2.3.7 by default, and I could find 
> "hive-exec-2.3.7-core.jar" after I finished build. And I could find 
> hive-exec-2.1.1.jar downloaded from maven when I use 
> "spark.sql.hive.metastore.jars maven" in the spark-defaults.conf.
>  
> I thought that there are some conflict between hive 2.1.1 and hive 2.3.7 when 
> I set the {color:#7a869a}spark.sql.hive.metastore.jars   
> ${SPARK_HOME}/lib/.{color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32153) .m2 repository corruption can happen on Jenkins-worker4

2020-07-02 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-32153:
--

 Summary: .m2 repository corruption can happen on Jenkins-worker4
 Key: SPARK-32153
 URL: https://issues.apache.org/jira/browse/SPARK-32153
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.0.1, 3.1.0
Reporter: Kousuke Saruta
Assignee: Shane Knapp


Build task on Jenkins-worker4 often fails with dependency problem.
[https://github.com/apache/spark/pull/28971#issuecomment-652611025]

[https://github.com/apache/spark/pull/28942#issuecomment-652842960]

These can be related to .m2 corruption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32153) .m2 repository corruption can happen on Jenkins-worker4

2020-07-02 Thread Kousuke Saruta (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150048#comment-17150048
 ] 

Kousuke Saruta commented on SPARK-32153:


[~shaneknapp] Could you look into this?

> .m2 repository corruption can happen on Jenkins-worker4
> ---
>
> Key: SPARK-32153
> URL: https://issues.apache.org/jira/browse/SPARK-32153
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Shane Knapp
>Priority: Critical
>
> Build task on Jenkins-worker4 often fails with dependency problem.
> [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
> [https://github.com/apache/spark/pull/28942#issuecomment-652842960]
> These can be related to .m2 corruption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32153) .m2 repository corruption can happen on Jenkins-worker4

2020-07-02 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32153:
---
Issue Type: Bug  (was: Improvement)

> .m2 repository corruption can happen on Jenkins-worker4
> ---
>
> Key: SPARK-32153
> URL: https://issues.apache.org/jira/browse/SPARK-32153
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Shane Knapp
>Priority: Critical
>
> Build task on Jenkins-worker4 often fails with dependency problem.
> [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
> [https://github.com/apache/spark/pull/28942#issuecomment-652842960]
> These can be related to .m2 corruption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32153) .m2 repository corruption can happen on Jenkins-worker4

2020-07-02 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32153:
---
Description: 
Build task on Jenkins-worker4 often fails with dependency problem.
 [https://github.com/apache/spark/pull/28971#issuecomment-652570066]
https://github.com/apache/spark/pull/28971#issuecomment-652611025
 [https://github.com/apache/spark/pull/28971#issuecomment-652690849]
 [https://github.com/apache/spark/pull/28942#issuecomment-652832012]
|https://github.com/apache/spark/pull/28971#issuecomment-652611025 
[https://github.com/apache/spark/pull/28942#issuecomment-652842960]
 [https://github.com/apache/spark/pull/28942#issuecomment-652835679]|

 

These can be related to .m2 corruption.

  was:
Build task on Jenkins-worker4 often fails with dependency problem.
[https://github.com/apache/spark/pull/28971#issuecomment-652570066]
 [https://github.com/apache/spark/pull/28971#issuecomment-652611025
https://github.com/apache/spark/pull/28971#issuecomment-652690849
https://github.com/apache/spark/pull/28942#issuecomment-652832012
|https://github.com/apache/spark/pull/28971#issuecomment-652611025] 
[https://github.com/apache/spark/pull/28942#issuecomment-652842960]
[https://github.com/apache/spark/pull/28942#issuecomment-652835679]

 

These can be related to .m2 corruption.


> .m2 repository corruption can happen on Jenkins-worker4
> ---
>
> Key: SPARK-32153
> URL: https://issues.apache.org/jira/browse/SPARK-32153
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Shane Knapp
>Priority: Critical
>
> Build task on Jenkins-worker4 often fails with dependency problem.
>  [https://github.com/apache/spark/pull/28971#issuecomment-652570066]
> https://github.com/apache/spark/pull/28971#issuecomment-652611025
>  [https://github.com/apache/spark/pull/28971#issuecomment-652690849]
>  [https://github.com/apache/spark/pull/28942#issuecomment-652832012]
> |https://github.com/apache/spark/pull/28971#issuecomment-652611025 
> [https://github.com/apache/spark/pull/28942#issuecomment-652842960]
>  [https://github.com/apache/spark/pull/28942#issuecomment-652835679]|
>  
> These can be related to .m2 corruption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32153) .m2 repository corruption can happen on Jenkins-worker4

2020-07-02 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32153:
---
Description: 
Build task on Jenkins-worker4 often fails with dependency problem.
 [https://github.com/apache/spark/pull/28971#issuecomment-652570066]
 [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
 [https://github.com/apache/spark/pull/28971#issuecomment-652690849]
 [https://github.com/apache/spark/pull/28942#issuecomment-652832012
https://github.com/apache/spark/pull/28971#issuecomment-652611025 
|https://github.com/apache/spark/pull/28942#issuecomment-652832012] 
[https://github.com/apache/spark/pull/28942#issuecomment-652842960] [
 |https://github.com/apache/spark/pull/28942#issuecomment-652832012] 
[https://github.com/apache/spark/pull/28942#issuecomment-652835679] 
[|https://github.com/apache/spark/pull/28942#issuecomment-652832012] 

These can be related to .m2 corruption.

  was:
Build task on Jenkins-worker4 often fails with dependency problem.
 [https://github.com/apache/spark/pull/28971#issuecomment-652570066]
https://github.com/apache/spark/pull/28971#issuecomment-652611025
 [https://github.com/apache/spark/pull/28971#issuecomment-652690849]
 [https://github.com/apache/spark/pull/28942#issuecomment-652832012]
|https://github.com/apache/spark/pull/28971#issuecomment-652611025 
[https://github.com/apache/spark/pull/28942#issuecomment-652842960]
 [https://github.com/apache/spark/pull/28942#issuecomment-652835679]|

 

These can be related to .m2 corruption.


> .m2 repository corruption can happen on Jenkins-worker4
> ---
>
> Key: SPARK-32153
> URL: https://issues.apache.org/jira/browse/SPARK-32153
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Shane Knapp
>Priority: Critical
>
> Build task on Jenkins-worker4 often fails with dependency problem.
>  [https://github.com/apache/spark/pull/28971#issuecomment-652570066]
>  [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
>  [https://github.com/apache/spark/pull/28971#issuecomment-652690849]
>  [https://github.com/apache/spark/pull/28942#issuecomment-652832012
> https://github.com/apache/spark/pull/28971#issuecomment-652611025 
> |https://github.com/apache/spark/pull/28942#issuecomment-652832012] 
> [https://github.com/apache/spark/pull/28942#issuecomment-652842960] [
>  |https://github.com/apache/spark/pull/28942#issuecomment-652832012] 
> [https://github.com/apache/spark/pull/28942#issuecomment-652835679] 
> [|https://github.com/apache/spark/pull/28942#issuecomment-652832012] 
> These can be related to .m2 corruption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32153) .m2 repository corruption can happen on Jenkins-worker4

2020-07-02 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32153:
---
Description: 
Build task on Jenkins-worker4 often fails with dependency problem.
[https://github.com/apache/spark/pull/28971#issuecomment-652570066]
[https://github.com/apache/spark/pull/28971#issuecomment-652611025]
[https://github.com/apache/spark/pull/28971#issuecomment-652690849] 
[https://github.com/apache/spark/pull/28971#issuecomment-652611025]
[https://github.com/apache/spark/pull/28942#issuecomment-652842960]
[https://github.com/apache/spark/pull/28942#issuecomment-652835679]

These can be related to .m2 corruption.

 

  was:
Build task on Jenkins-worker4 often fails with dependency problem.
 [https://github.com/apache/spark/pull/28971#issuecomment-652570066]
 [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
 [https://github.com/apache/spark/pull/28971#issuecomment-652690849]
 [https://github.com/apache/spark/pull/28942#issuecomment-652832012
https://github.com/apache/spark/pull/28971#issuecomment-652611025 
|https://github.com/apache/spark/pull/28942#issuecomment-652832012] 
[https://github.com/apache/spark/pull/28942#issuecomment-652842960] [
 |https://github.com/apache/spark/pull/28942#issuecomment-652832012] 
[https://github.com/apache/spark/pull/28942#issuecomment-652835679] 
[|https://github.com/apache/spark/pull/28942#issuecomment-652832012] 

These can be related to .m2 corruption.


> .m2 repository corruption can happen on Jenkins-worker4
> ---
>
> Key: SPARK-32153
> URL: https://issues.apache.org/jira/browse/SPARK-32153
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Shane Knapp
>Priority: Critical
>
> Build task on Jenkins-worker4 often fails with dependency problem.
> [https://github.com/apache/spark/pull/28971#issuecomment-652570066]
> [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
> [https://github.com/apache/spark/pull/28971#issuecomment-652690849] 
> [https://github.com/apache/spark/pull/28971#issuecomment-652611025]
> [https://github.com/apache/spark/pull/28942#issuecomment-652842960]
> [https://github.com/apache/spark/pull/28942#issuecomment-652835679]
> These can be related to .m2 corruption.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32153) .m2 repository corruption can happen on Jenkins-worker4

2020-07-02 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32153:
---
Description: 
Build task on Jenkins-worker4 often fails with dependency problem.
[https://github.com/apache/spark/pull/28971#issuecomment-652570066]
 [https://github.com/apache/spark/pull/28971#issuecomment-652611025
https://github.com/apache/spark/pull/28971#issuecomment-652690849
https://github.com/apache/spark/pull/28942#issuecomment-652832012
|https://github.com/apache/spark/pull/28971#issuecomment-652611025] 
[https://github.com/apache/spark/pull/28942#issuecomment-652842960]
[https://github.com/apache/spark/pull/28942#issuecomment-652835679]

 

These can be related to .m2 corruption.

  was:
Build task on Jenkins-worker4 often fails with dependency problem.
[https://github.com/apache/spark/pull/28971#issuecomment-652611025]

[https://github.com/apache/spark/pull/28942#issuecomment-652842960]

These can be related to .m2 corruption.


> .m2 repository corruption can happen on Jenkins-worker4
> ---
>
> Key: SPARK-32153
> URL: https://issues.apache.org/jira/browse/SPARK-32153
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Shane Knapp
>Priority: Critical
>
> Build task on Jenkins-worker4 often fails with dependency problem.
> [https://github.com/apache/spark/pull/28971#issuecomment-652570066]
>  [https://github.com/apache/spark/pull/28971#issuecomment-652611025
> https://github.com/apache/spark/pull/28971#issuecomment-652690849
> https://github.com/apache/spark/pull/28942#issuecomment-652832012
> |https://github.com/apache/spark/pull/28971#issuecomment-652611025] 
> [https://github.com/apache/spark/pull/28942#issuecomment-652842960]
> [https://github.com/apache/spark/pull/28942#issuecomment-652835679]
>  
> These can be related to .m2 corruption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32154) Use ExpressionEncoder to serialize to catalyst type for the return type of ScalaUDF

2020-07-02 Thread wuyi (Jira)

wuyi created SPARK-32154:


 Summary: Use ExpressionEncoder to serialize to catalyst type for 
the return type of ScalaUDF
 Key: SPARK-32154
 URL: https://issues.apache.org/jira/browse/SPARK-32154
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: wuyi


Users now could register a UDF with Instant/LocalDate as return type even with 

spark.sql.datetime.java8API.enabled=false. However, the UDF can only be really 
used with spark.sql.datetime.java8API.enabled=true. This could make users 
confused.

The problem is we use ExpressionEncoder to ser/deser types when registering the 
UDF, but use Catalyst converters to ser/deser types, which is under control of  
spark.sql.datetime.java8API.enabled,  when executing UDF.

If we could also use ExpressionEncoder to ser/deser types, similar to what we 
do for input parameter types, the, UDF could support Instant/LocalDate, event 
other combined complex types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32154) Use ExpressionEncoder for the return type of ScalaUDF to serialize to catalyst type

2020-07-02 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-32154:
-
Summary: Use ExpressionEncoder for the return type of ScalaUDF to serialize 
to catalyst type  (was: Use ExpressionEncoder to serialize to catalyst type for 
the return type of ScalaUDF)

> Use ExpressionEncoder for the return type of ScalaUDF to serialize to 
> catalyst type
> ---
>
> Key: SPARK-32154
> URL: https://issues.apache.org/jira/browse/SPARK-32154
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Users now could register a UDF with Instant/LocalDate as return type even 
> with 
> spark.sql.datetime.java8API.enabled=false. However, the UDF can only be 
> really used with spark.sql.datetime.java8API.enabled=true. This could make 
> users confused.
> The problem is we use ExpressionEncoder to ser/deser types when registering 
> the UDF, but use Catalyst converters to ser/deser types, which is under 
> control of  spark.sql.datetime.java8API.enabled,  when executing UDF.
> If we could also use ExpressionEncoder to ser/deser types, similar to what we 
> do for input parameter types, the, UDF could support Instant/LocalDate, event 
> other combined complex types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32154) Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst type

2020-07-02 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-32154:
-
Summary: Use ExpressionEncoder for the return type of ScalaUDF to convert 
to catalyst type  (was: Use ExpressionEncoder for the return type of ScalaUDF 
to serialize to catalyst type)

> Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst 
> type
> -
>
> Key: SPARK-32154
> URL: https://issues.apache.org/jira/browse/SPARK-32154
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Users now could register a UDF with Instant/LocalDate as return type even 
> with 
> spark.sql.datetime.java8API.enabled=false. However, the UDF can only be 
> really used with spark.sql.datetime.java8API.enabled=true. This could make 
> users confused.
> The problem is we use ExpressionEncoder to ser/deser types when registering 
> the UDF, but use Catalyst converters to ser/deser types, which is under 
> control of  spark.sql.datetime.java8API.enabled,  when executing UDF.
> If we could also use ExpressionEncoder to ser/deser types, similar to what we 
> do for input parameter types, the, UDF could support Instant/LocalDate, event 
> other combined complex types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32155) Provide options for offset-based semantics when using structured streaming from a file stream source

2020-07-02 Thread Christopher Highman (Jira)

Christopher Highman created SPARK-32155:
---

 Summary: Provide options for offset-based semantics when using 
structured streaming from a file stream source
 Key: SPARK-32155
 URL: https://issues.apache.org/jira/browse/SPARK-32155
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: Christopher Highman


Implement the following options while performing structured streaming from a 
file data source:
{code:java}
startingOffsetsByTimestamp
endingOffsetsByTimestamp
startingOffsets
endingOffsets
{code}
These options currently exist when using structured streaming from a Kafka data 
source. 

*Please see comments from the below PR for details.* 
[#28841|[https://github.com/apache/spark/pull/28841]]

*Example from usage with Kafka data source*
[http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32155) Provide options for offset-based semantics when using structured streaming from a file stream source

2020-07-02 Thread Christopher Highman (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Highman updated SPARK-32155:

Description: 
Implement the following options while performing structured streaming from a 
file data source:
{code:java}
startingOffsetsByTimestamp
endingOffsetsByTimestamp
startingOffsets
endingOffsets
{code}
These options currently exist when using structured streaming from a Kafka data 
source. 

*Please see comments from the below PR for details.*
[https://github.com/apache/spark/pull/28841]

*Example from usage with Kafka data source*
 
[http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries]

  was:
Implement the following options while performing structured streaming from a 
file data source:
{code:java}
startingOffsetsByTimestamp
endingOffsetsByTimestamp
startingOffsets
endingOffsets
{code}
These options currently exist when using structured streaming from a Kafka data 
source. 

*Please see comments from the below PR for details.* 
[#28841|[https://github.com/apache/spark/pull/28841]]

*Example from usage with Kafka data source*
[http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries]


> Provide options for offset-based semantics when using structured streaming 
> from a file stream source
> 
>
> Key: SPARK-32155
> URL: https://issues.apache.org/jira/browse/SPARK-32155
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Christopher Highman
>Priority: Minor
>
> Implement the following options while performing structured streaming from a 
> file data source:
> {code:java}
> startingOffsetsByTimestamp
> endingOffsetsByTimestamp
> startingOffsets
> endingOffsets
> {code}
> These options currently exist when using structured streaming from a Kafka 
> data source. 
> *Please see comments from the below PR for details.*
> [https://github.com/apache/spark/pull/28841]
> *Example from usage with Kafka data source*
>  
> [http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32154) Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst type

2020-07-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32154:


Assignee: Apache Spark

> Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst 
> type
> -
>
> Key: SPARK-32154
> URL: https://issues.apache.org/jira/browse/SPARK-32154
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> Users now could register a UDF with Instant/LocalDate as return type even 
> with 
> spark.sql.datetime.java8API.enabled=false. However, the UDF can only be 
> really used with spark.sql.datetime.java8API.enabled=true. This could make 
> users confused.
> The problem is we use ExpressionEncoder to ser/deser types when registering 
> the UDF, but use Catalyst converters to ser/deser types, which is under 
> control of  spark.sql.datetime.java8API.enabled,  when executing UDF.
> If we could also use ExpressionEncoder to ser/deser types, similar to what we 
> do for input parameter types, the, UDF could support Instant/LocalDate, event 
> other combined complex types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32154) Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst type

2020-07-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32154:


Assignee: (was: Apache Spark)

> Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst 
> type
> -
>
> Key: SPARK-32154
> URL: https://issues.apache.org/jira/browse/SPARK-32154
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Users now could register a UDF with Instant/LocalDate as return type even 
> with 
> spark.sql.datetime.java8API.enabled=false. However, the UDF can only be 
> really used with spark.sql.datetime.java8API.enabled=true. This could make 
> users confused.
> The problem is we use ExpressionEncoder to ser/deser types when registering 
> the UDF, but use Catalyst converters to ser/deser types, which is under 
> control of  spark.sql.datetime.java8API.enabled,  when executing UDF.
> If we could also use ExpressionEncoder to ser/deser types, similar to what we 
> do for input parameter types, the, UDF could support Instant/LocalDate, event 
> other combined complex types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32154) Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst type

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150132#comment-17150132
 ] 

Apache Spark commented on SPARK-32154:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/28979

> Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst 
> type
> -
>
> Key: SPARK-32154
> URL: https://issues.apache.org/jira/browse/SPARK-32154
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Users now could register a UDF with Instant/LocalDate as return type even 
> with 
> spark.sql.datetime.java8API.enabled=false. However, the UDF can only be 
> really used with spark.sql.datetime.java8API.enabled=true. This could make 
> users confused.
> The problem is we use ExpressionEncoder to ser/deser types when registering 
> the UDF, but use Catalyst converters to ser/deser types, which is under 
> control of  spark.sql.datetime.java8API.enabled,  when executing UDF.
> If we could also use ExpressionEncoder to ser/deser types, similar to what we 
> do for input parameter types, the, UDF could support Instant/LocalDate, event 
> other combined complex types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32154) Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst type

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150133#comment-17150133
 ] 

Apache Spark commented on SPARK-32154:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/28979

> Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst 
> type
> -
>
> Key: SPARK-32154
> URL: https://issues.apache.org/jira/browse/SPARK-32154
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Users now could register a UDF with Instant/LocalDate as return type even 
> with 
> spark.sql.datetime.java8API.enabled=false. However, the UDF can only be 
> really used with spark.sql.datetime.java8API.enabled=true. This could make 
> users confused.
> The problem is we use ExpressionEncoder to ser/deser types when registering 
> the UDF, but use Catalyst converters to ser/deser types, which is under 
> control of  spark.sql.datetime.java8API.enabled,  when executing UDF.
> If we could also use ExpressionEncoder to ser/deser types, similar to what we 
> do for input parameter types, the, UDF could support Instant/LocalDate, event 
> other combined complex types as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32156) SPARK-31061 has two very similar tests could merge and somewhere could be improved

2020-07-02 Thread JinxinTang (Jira)

JinxinTang created SPARK-32156:
--

 Summary: SPARK-31061 has two very similar tests could merge and 
somewhere could be improved
 Key: SPARK-32156
 URL: https://issues.apache.org/jira/browse/SPARK-32156
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.0.0
Reporter: JinxinTang
 Fix For: 3.0.0


In `org.apache.spark.sql.hive.HiveExternalCatalogSuite`

`

test("SPARK-31061: alterTable should be able to change table provider") {
 val catalog = newBasicCatalog()
 val parquetTable = CatalogTable(
 identifier = TableIdentifier("parq_tbl", Some("db1")),
 tableType = CatalogTableType.MANAGED,
 storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))),
 schema = new StructType().add("col1", "int").add("col2", "string"),
 provider = Some("parquet"))
 catalog.createTable(parquetTable, ignoreIfExists = false)

 val rawTable = externalCatalog.getTable("db1", "parq_tbl")
 assert(rawTable.provider === Some("parquet"))

 val fooTable = *parquetTable*.copy(provider = Some("foo")) <- `*parquetTable*` 
seems should be rawTable
 catalog.alterTable(fooTable)
 val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
 assert(alteredTable.provider === Some("foo"))
}

test("SPARK-31061: alterTable should be able to change table provider from 
hive") {
 val catalog = newBasicCatalog()
 val hiveTable = CatalogTable(
 identifier = TableIdentifier("parq_tbl", Some("db1")),
 tableType = CatalogTableType.MANAGED,
 storage = storageFormat,
 schema = new StructType().add("col1", "int").add("col2", "string"),
 provider = Some("hive"))
 catalog.createTable(hiveTable, ignoreIfExists = false)

 val rawTable = externalCatalog.getTable("db1", "parq_tbl")
 assert(rawTable.provider === Some("hive"))

 val fooTable = rawTable.copy(provider = Some("foo"))
 catalog.alterTable(fooTable)
 val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
 assert(alteredTable.provider === Some("foo"))
}

`

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32156) SPARK-31061 has two very similar tests could merge and somewhere could be improved

2020-07-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32156:


Assignee: Apache Spark

> SPARK-31061 has two very similar tests could merge and somewhere could be 
> improved
> --
>
> Key: SPARK-32156
> URL: https://issues.apache.org/jira/browse/SPARK-32156
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: JinxinTang
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.0.0
>
>
> In `org.apache.spark.sql.hive.HiveExternalCatalogSuite`
> `
> test("SPARK-31061: alterTable should be able to change table provider") {
>  val catalog = newBasicCatalog()
>  val parquetTable = CatalogTable(
>  identifier = TableIdentifier("parq_tbl", Some("db1")),
>  tableType = CatalogTableType.MANAGED,
>  storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))),
>  schema = new StructType().add("col1", "int").add("col2", "string"),
>  provider = Some("parquet"))
>  catalog.createTable(parquetTable, ignoreIfExists = false)
>  val rawTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(rawTable.provider === Some("parquet"))
>  val fooTable = *parquetTable*.copy(provider = Some("foo")) <- 
> `*parquetTable*` seems should be rawTable
>  catalog.alterTable(fooTable)
>  val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(alteredTable.provider === Some("foo"))
> }
> test("SPARK-31061: alterTable should be able to change table provider from 
> hive") {
>  val catalog = newBasicCatalog()
>  val hiveTable = CatalogTable(
>  identifier = TableIdentifier("parq_tbl", Some("db1")),
>  tableType = CatalogTableType.MANAGED,
>  storage = storageFormat,
>  schema = new StructType().add("col1", "int").add("col2", "string"),
>  provider = Some("hive"))
>  catalog.createTable(hiveTable, ignoreIfExists = false)
>  val rawTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(rawTable.provider === Some("hive"))
>  val fooTable = rawTable.copy(provider = Some("foo"))
>  catalog.alterTable(fooTable)
>  val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(alteredTable.provider === Some("foo"))
> }
> `
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32156) SPARK-31061 has two very similar tests could merge and somewhere could be improved

2020-07-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32156:


Assignee: (was: Apache Spark)

> SPARK-31061 has two very similar tests could merge and somewhere could be 
> improved
> --
>
> Key: SPARK-32156
> URL: https://issues.apache.org/jira/browse/SPARK-32156
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: JinxinTang
>Priority: Major
> Fix For: 3.0.0
>
>
> In `org.apache.spark.sql.hive.HiveExternalCatalogSuite`
> `
> test("SPARK-31061: alterTable should be able to change table provider") {
>  val catalog = newBasicCatalog()
>  val parquetTable = CatalogTable(
>  identifier = TableIdentifier("parq_tbl", Some("db1")),
>  tableType = CatalogTableType.MANAGED,
>  storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))),
>  schema = new StructType().add("col1", "int").add("col2", "string"),
>  provider = Some("parquet"))
>  catalog.createTable(parquetTable, ignoreIfExists = false)
>  val rawTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(rawTable.provider === Some("parquet"))
>  val fooTable = *parquetTable*.copy(provider = Some("foo")) <- 
> `*parquetTable*` seems should be rawTable
>  catalog.alterTable(fooTable)
>  val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(alteredTable.provider === Some("foo"))
> }
> test("SPARK-31061: alterTable should be able to change table provider from 
> hive") {
>  val catalog = newBasicCatalog()
>  val hiveTable = CatalogTable(
>  identifier = TableIdentifier("parq_tbl", Some("db1")),
>  tableType = CatalogTableType.MANAGED,
>  storage = storageFormat,
>  schema = new StructType().add("col1", "int").add("col2", "string"),
>  provider = Some("hive"))
>  catalog.createTable(hiveTable, ignoreIfExists = false)
>  val rawTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(rawTable.provider === Some("hive"))
>  val fooTable = rawTable.copy(provider = Some("foo"))
>  catalog.alterTable(fooTable)
>  val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(alteredTable.provider === Some("foo"))
> }
> `
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32156) SPARK-31061 has two very similar tests could merge and somewhere could be improved

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150150#comment-17150150
 ] 

Apache Spark commented on SPARK-32156:
--

User 'TJX2014' has created a pull request for this issue:
https://github.com/apache/spark/pull/28980

> SPARK-31061 has two very similar tests could merge and somewhere could be 
> improved
> --
>
> Key: SPARK-32156
> URL: https://issues.apache.org/jira/browse/SPARK-32156
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: JinxinTang
>Priority: Major
> Fix For: 3.0.0
>
>
> In `org.apache.spark.sql.hive.HiveExternalCatalogSuite`
> `
> test("SPARK-31061: alterTable should be able to change table provider") {
>  val catalog = newBasicCatalog()
>  val parquetTable = CatalogTable(
>  identifier = TableIdentifier("parq_tbl", Some("db1")),
>  tableType = CatalogTableType.MANAGED,
>  storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))),
>  schema = new StructType().add("col1", "int").add("col2", "string"),
>  provider = Some("parquet"))
>  catalog.createTable(parquetTable, ignoreIfExists = false)
>  val rawTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(rawTable.provider === Some("parquet"))
>  val fooTable = *parquetTable*.copy(provider = Some("foo")) <- 
> `*parquetTable*` seems should be rawTable
>  catalog.alterTable(fooTable)
>  val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(alteredTable.provider === Some("foo"))
> }
> test("SPARK-31061: alterTable should be able to change table provider from 
> hive") {
>  val catalog = newBasicCatalog()
>  val hiveTable = CatalogTable(
>  identifier = TableIdentifier("parq_tbl", Some("db1")),
>  tableType = CatalogTableType.MANAGED,
>  storage = storageFormat,
>  schema = new StructType().add("col1", "int").add("col2", "string"),
>  provider = Some("hive"))
>  catalog.createTable(hiveTable, ignoreIfExists = false)
>  val rawTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(rawTable.provider === Some("hive"))
>  val fooTable = rawTable.copy(provider = Some("foo"))
>  catalog.alterTable(fooTable)
>  val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(alteredTable.provider === Some("foo"))
> }
> `
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32156) SPARK-31061 has two very similar tests could merge and somewhere could be improved

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150153#comment-17150153
 ] 

Apache Spark commented on SPARK-32156:
--

User 'TJX2014' has created a pull request for this issue:
https://github.com/apache/spark/pull/28980

> SPARK-31061 has two very similar tests could merge and somewhere could be 
> improved
> --
>
> Key: SPARK-32156
> URL: https://issues.apache.org/jira/browse/SPARK-32156
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: JinxinTang
>Priority: Major
> Fix For: 3.0.0
>
>
> In `org.apache.spark.sql.hive.HiveExternalCatalogSuite`
> `
> test("SPARK-31061: alterTable should be able to change table provider") {
>  val catalog = newBasicCatalog()
>  val parquetTable = CatalogTable(
>  identifier = TableIdentifier("parq_tbl", Some("db1")),
>  tableType = CatalogTableType.MANAGED,
>  storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))),
>  schema = new StructType().add("col1", "int").add("col2", "string"),
>  provider = Some("parquet"))
>  catalog.createTable(parquetTable, ignoreIfExists = false)
>  val rawTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(rawTable.provider === Some("parquet"))
>  val fooTable = *parquetTable*.copy(provider = Some("foo")) <- 
> `*parquetTable*` seems should be rawTable
>  catalog.alterTable(fooTable)
>  val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(alteredTable.provider === Some("foo"))
> }
> test("SPARK-31061: alterTable should be able to change table provider from 
> hive") {
>  val catalog = newBasicCatalog()
>  val hiveTable = CatalogTable(
>  identifier = TableIdentifier("parq_tbl", Some("db1")),
>  tableType = CatalogTableType.MANAGED,
>  storage = storageFormat,
>  schema = new StructType().add("col1", "int").add("col2", "string"),
>  provider = Some("hive"))
>  catalog.createTable(hiveTable, ignoreIfExists = false)
>  val rawTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(rawTable.provider === Some("hive"))
>  val fooTable = rawTable.copy(provider = Some("foo"))
>  catalog.alterTable(fooTable)
>  val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(alteredTable.provider === Some("foo"))
> }
> `
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31061) Impossible to change the provider of a table in the HiveMetaStore

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150156#comment-17150156
 ] 

Apache Spark commented on SPARK-31061:
--

User 'TJX2014' has created a pull request for this issue:
https://github.com/apache/spark/pull/28980

> Impossible to change the provider of a table in the HiveMetaStore
> -
>
> Key: SPARK-31061
> URL: https://issues.apache.org/jira/browse/SPARK-31061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, it's impossible to alter the datasource of a table in the 
> HiveMetaStore by using alterTable, as the HiveExternalCatalog doesn't change 
> the provider table property during an alterTable command. This is required to 
> support changing table formats when using commands like REPLACE TABLE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31061) Impossible to change the provider of a table in the HiveMetaStore

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150154#comment-17150154
 ] 

Apache Spark commented on SPARK-31061:
--

User 'TJX2014' has created a pull request for this issue:
https://github.com/apache/spark/pull/28980

> Impossible to change the provider of a table in the HiveMetaStore
> -
>
> Key: SPARK-31061
> URL: https://issues.apache.org/jira/browse/SPARK-31061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, it's impossible to alter the datasource of a table in the 
> HiveMetaStore by using alterTable, as the HiveExternalCatalog doesn't change 
> the provider table property during an alterTable command. This is required to 
> support changing table formats when using commands like REPLACE TABLE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31061) Impossible to change the provider of a table in the HiveMetaStore

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150155#comment-17150155
 ] 

Apache Spark commented on SPARK-31061:
--

User 'TJX2014' has created a pull request for this issue:
https://github.com/apache/spark/pull/28980

> Impossible to change the provider of a table in the HiveMetaStore
> -
>
> Key: SPARK-31061
> URL: https://issues.apache.org/jira/browse/SPARK-31061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, it's impossible to alter the datasource of a table in the 
> HiveMetaStore by using alterTable, as the HiveExternalCatalog doesn't change 
> the provider table property during an alterTable command. This is required to 
> support changing table formats when using commands like REPLACE TABLE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31061) Impossible to change the provider of a table in the HiveMetaStore

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150157#comment-17150157
 ] 

Apache Spark commented on SPARK-31061:
--

User 'TJX2014' has created a pull request for this issue:
https://github.com/apache/spark/pull/28980

> Impossible to change the provider of a table in the HiveMetaStore
> -
>
> Key: SPARK-31061
> URL: https://issues.apache.org/jira/browse/SPARK-31061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, it's impossible to alter the datasource of a table in the 
> HiveMetaStore by using alterTable, as the HiveExternalCatalog doesn't change 
> the provider table property during an alterTable command. This is required to 
> support changing table formats when using commands like REPLACE TABLE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32121) ExternalShuffleBlockResolverSuite failed on Windows

2020-07-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32121:


Assignee: Cheng Pan

> ExternalShuffleBlockResolverSuite failed on Windows
> ---
>
> Key: SPARK-32121
> URL: https://issues.apache.org/jira/browse/SPARK-32121
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0, 3.0.1
> Environment: Windows 10
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> The method {code}ExecutorDiskUtils.createNormalizedInternedPathname{code} 
> should consider the Windows file separator.
> {code}
> [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.132 
> s <<< FAILURE! - in 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite
> [ERROR] 
> testNormalizeAndInternPathname(org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite)
>   Time elapsed: 0 s  <<< FAILURE!
> org.junit.ComparisonFailure: expected: but 
> was:
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.assertPathsMatch(ExternalShuffleBlockResolverSuite.java:160)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.testNormalizeAndInternPathname(ExternalShuffleBlockResolverSuite.java:149)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32121) ExternalShuffleBlockResolverSuite failed on Windows

2020-07-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32121.
--
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 28940
[https://github.com/apache/spark/pull/28940]

> ExternalShuffleBlockResolverSuite failed on Windows
> ---
>
> Key: SPARK-32121
> URL: https://issues.apache.org/jira/browse/SPARK-32121
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0, 3.0.1
> Environment: Windows 10
>Reporter: Cheng Pan
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> The method {code}ExecutorDiskUtils.createNormalizedInternedPathname{code} 
> should consider the Windows file separator.
> {code}
> [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.132 
> s <<< FAILURE! - in 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite
> [ERROR] 
> testNormalizeAndInternPathname(org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite)
>   Time elapsed: 0 s  <<< FAILURE!
> org.junit.ComparisonFailure: expected: but 
> was:
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.assertPathsMatch(ExternalShuffleBlockResolverSuite.java:160)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.testNormalizeAndInternPathname(ExternalShuffleBlockResolverSuite.java:149)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32156) Refactor two similar test cases from SPARK-31061 in HiveExternalCatalogSuite

2020-07-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32156:
-
Summary: Refactor two similar test cases from SPARK-31061 in 
HiveExternalCatalogSuite  (was: SPARK-31061 has two very similar tests could 
merge and somewhere could be improved)

> Refactor two similar test cases from SPARK-31061 in HiveExternalCatalogSuite
> 
>
> Key: SPARK-32156
> URL: https://issues.apache.org/jira/browse/SPARK-32156
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: JinxinTang
>Priority: Major
> Fix For: 3.0.0
>
>
> In `org.apache.spark.sql.hive.HiveExternalCatalogSuite`
> `
> test("SPARK-31061: alterTable should be able to change table provider") {
>  val catalog = newBasicCatalog()
>  val parquetTable = CatalogTable(
>  identifier = TableIdentifier("parq_tbl", Some("db1")),
>  tableType = CatalogTableType.MANAGED,
>  storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))),
>  schema = new StructType().add("col1", "int").add("col2", "string"),
>  provider = Some("parquet"))
>  catalog.createTable(parquetTable, ignoreIfExists = false)
>  val rawTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(rawTable.provider === Some("parquet"))
>  val fooTable = *parquetTable*.copy(provider = Some("foo")) <- 
> `*parquetTable*` seems should be rawTable
>  catalog.alterTable(fooTable)
>  val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(alteredTable.provider === Some("foo"))
> }
> test("SPARK-31061: alterTable should be able to change table provider from 
> hive") {
>  val catalog = newBasicCatalog()
>  val hiveTable = CatalogTable(
>  identifier = TableIdentifier("parq_tbl", Some("db1")),
>  tableType = CatalogTableType.MANAGED,
>  storage = storageFormat,
>  schema = new StructType().add("col1", "int").add("col2", "string"),
>  provider = Some("hive"))
>  catalog.createTable(hiveTable, ignoreIfExists = false)
>  val rawTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(rawTable.provider === Some("hive"))
>  val fooTable = rawTable.copy(provider = Some("foo"))
>  catalog.alterTable(fooTable)
>  val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(alteredTable.provider === Some("foo"))
> }
> `
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25433) Add support for PEX in PySpark

2020-07-02 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150298#comment-17150298
 ] 

Hyukjin Kwon commented on SPARK-25433:
--

[~fhoering], I plan to redesign the PySpark documentation and I would like to 
put this in the documentation. Are you still active? I will cc on the related 
JIRAs if you are still interested in contributing the documentation.

> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors using [PEX|https://github.com/pantsbuild/pex] 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
>  (disadvantages are that you have a separate conda package repo and ship the 
> python interpreter all the time)
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing PYSPARK_PYTHON env 
> variable should already work.
> I also have seen this 
> [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the packages on 
> each excecutor and recreate your virtual environment each time. Same problem 
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily 
> shippable to another machine. In particular there is the relocatable option 
> (see 
> [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
>  
> [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
>  which makes it very complicated for the user to ship the virtual env and be 
> sure it works.
> And here is where pex comes in. It is a nice way to create a single 
> executable zip file with all dependencies included. You have the pex command 
> line tool to build your package and when it is built you are sure it works. 
> This is in my opinion the most elegant way to ship python code (better than 
> virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one 
> single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
> to the pex files doesn't work. You can nevertheless tune the env variable 
> [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
>  and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31100) Detect namespace existence when setting namespace

2020-07-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31100.
-
Fix Version/s: 3.1.0
 Assignee: Jackey Lee
   Resolution: Fixed

> Detect namespace existence when setting namespace
> -
>
> Key: SPARK-31100
> URL: https://issues.apache.org/jira/browse/SPARK-31100
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jackey Lee
>Assignee: Jackey Lee
>Priority: Major
> Fix For: 3.1.0
>
>
> We should check if the namespace exists while calling "use namespace", and 
> throw NoSuchNamespaceException if namespace not exists.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32157) Integer overflow when constructing large query plan string

2020-07-02 Thread Tanel Kiis (Jira)

Tanel Kiis created SPARK-32157:
--

 Summary: Integer overflow when constructing large query plan 
string 
 Key: SPARK-32157
 URL: https://issues.apache.org/jira/browse/SPARK-32157
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Tanel Kiis


When the length of the string representation of the query plan in 
org.apache.spark.sql.catalyst.util.StringUtils.PlanStringConcat goes above 
Integer.MAX_VALUE, then the query can end with either of these two exception:

"spark.sql.maxPlanStringLength" was set to 0:
{noformat}
java.lang.NegativeArraySizeException
at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:68)
at java.lang.StringBuilder.(StringBuilder.java:101)
at 
org.apache.spark.sql.catalyst.util.StringUtils$StringConcat.toString(StringUtils.scala:136)
at 
org.apache.spark.sql.catalyst.util.StringUtils$PlanStringConcat.toString(StringUtils.scala:163)
at 
org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:208)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:95)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:944)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:396)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:380)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:269)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:829)
{noformat}

"spark.sql.maxPlanStringLength" was at the default value:
{noformat}
java.lang.StringIndexOutOfBoundsException: String index out of range: -47
at java.lang.String.substring(String.java:1967)
at 
org.apache.spark.sql.catalyst.util.StringUtils$StringConcat.append(StringUtils.scala:123)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$toString$1(QueryExecution.scala:207)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$toString$1$adapted(QueryExecution.scala:207)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeString$1(TreeNode.scala:663)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeString$1$adapted(TreeNode.scala:662)
at scala.collection.immutable.List.foreach(List.scala:392)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:662)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
at 
org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
at 
org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
at 
org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeString$3(TreeNode.scala:693)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeString$3$adapted(TreeNode.scala:691)
at scala.collection.immutable.List.foreach(List.scala:392)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeN

[jira] [Commented] (SPARK-30132) Scala 2.13 compile errors from Hadoop LocalFileSystem subclasses

2020-07-02 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150387#comment-17150387
 ] 

Dongjoon Hyun commented on SPARK-30132:
---

Nice! Thanks!

> Scala 2.13 compile errors from Hadoop LocalFileSystem subclasses
> 
>
> Key: SPARK-30132
> URL: https://issues.apache.org/jira/browse/SPARK-30132
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Priority: Minor
>
> A few classes in our test code extend Hadoop's LocalFileSystem. Scala 2.13 
> returns a compile error here - not for the Spark code, but because the Hadoop 
> code (it says) illegally overrides appendFile() with slightly different 
> generic types in its return value. This code is valid Java, evidently, and 
> the code actually doesn't define any generic types, so, I even wonder if it's 
> a scalac bug.
> So far I don't see a workaround for this.
> This only affects the Hadoop 3.2 build, in that it comes up with respect to a 
> method new in Hadoop 3. (There is actually another instance of a similar 
> problem that affects Hadoop 2, but I can see a tiny hack workaround for it).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32157) Integer overflow when constructing large query plan string

2020-07-02 Thread Tanel Kiis (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis resolved SPARK-32157.

Resolution: Duplicate

> Integer overflow when constructing large query plan string 
> ---
>
> Key: SPARK-32157
> URL: https://issues.apache.org/jira/browse/SPARK-32157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tanel Kiis
>Priority: Major
>
> When the length of the string representation of the query plan in 
> org.apache.spark.sql.catalyst.util.StringUtils.PlanStringConcat goes above 
> Integer.MAX_VALUE, then the query can end with either of these two exception:
> "spark.sql.maxPlanStringLength" was set to 0:
> {noformat}
> java.lang.NegativeArraySizeException
>   at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:68)
>   at java.lang.StringBuilder.(StringBuilder.java:101)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$StringConcat.toString(StringUtils.scala:136)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$PlanStringConcat.toString(StringUtils.scala:163)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:208)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:95)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:944)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:396)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:380)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:269)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:829)
> {noformat}
> "spark.sql.maxPlanStringLength" was at the default value:
> {noformat}
> java.lang.StringIndexOutOfBoundsException: String index out of range: -47
>   at java.lang.String.substring(String.java:1967)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$StringConcat.append(StringUtils.scala:123)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$toString$1(QueryExecution.scala:207)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$toString$1$adapted(QueryExecution.scala:207)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeString$1(TreeNode.scala:663)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeString$1$adapted(TreeNode.scala:662)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:662)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.catalyst.trees.Tree

[jira] [Closed] (SPARK-32157) Integer overflow when constructing large query plan string

2020-07-02 Thread Tanel Kiis (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis closed SPARK-32157.
--

> Integer overflow when constructing large query plan string 
> ---
>
> Key: SPARK-32157
> URL: https://issues.apache.org/jira/browse/SPARK-32157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tanel Kiis
>Priority: Major
>
> When the length of the string representation of the query plan in 
> org.apache.spark.sql.catalyst.util.StringUtils.PlanStringConcat goes above 
> Integer.MAX_VALUE, then the query can end with either of these two exception:
> "spark.sql.maxPlanStringLength" was set to 0:
> {noformat}
> java.lang.NegativeArraySizeException
>   at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:68)
>   at java.lang.StringBuilder.(StringBuilder.java:101)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$StringConcat.toString(StringUtils.scala:136)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$PlanStringConcat.toString(StringUtils.scala:163)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:208)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:95)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:944)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:396)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:380)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:269)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:829)
> {noformat}
> "spark.sql.maxPlanStringLength" was at the default value:
> {noformat}
> java.lang.StringIndexOutOfBoundsException: String index out of range: -47
>   at java.lang.String.substring(String.java:1967)
>   at 
> org.apache.spark.sql.catalyst.util.StringUtils$StringConcat.append(StringUtils.scala:123)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$toString$1(QueryExecution.scala:207)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$toString$1$adapted(QueryExecution.scala:207)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeString$1(TreeNode.scala:663)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeString$1$adapted(TreeNode.scala:662)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:662)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeStri

[jira] [Resolved] (SPARK-32156) Refactor two similar test cases from SPARK-31061 in HiveExternalCatalogSuite

2020-07-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32156.
---
Fix Version/s: (was: 3.0.0)
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 28980
[https://github.com/apache/spark/pull/28980]

> Refactor two similar test cases from SPARK-31061 in HiveExternalCatalogSuite
> 
>
> Key: SPARK-32156
> URL: https://issues.apache.org/jira/browse/SPARK-32156
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: JinxinTang
>Assignee: JinxinTang
>Priority: Major
> Fix For: 3.1.0
>
>
> In `org.apache.spark.sql.hive.HiveExternalCatalogSuite`
> `
> test("SPARK-31061: alterTable should be able to change table provider") {
>  val catalog = newBasicCatalog()
>  val parquetTable = CatalogTable(
>  identifier = TableIdentifier("parq_tbl", Some("db1")),
>  tableType = CatalogTableType.MANAGED,
>  storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))),
>  schema = new StructType().add("col1", "int").add("col2", "string"),
>  provider = Some("parquet"))
>  catalog.createTable(parquetTable, ignoreIfExists = false)
>  val rawTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(rawTable.provider === Some("parquet"))
>  val fooTable = *parquetTable*.copy(provider = Some("foo")) <- 
> `*parquetTable*` seems should be rawTable
>  catalog.alterTable(fooTable)
>  val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(alteredTable.provider === Some("foo"))
> }
> test("SPARK-31061: alterTable should be able to change table provider from 
> hive") {
>  val catalog = newBasicCatalog()
>  val hiveTable = CatalogTable(
>  identifier = TableIdentifier("parq_tbl", Some("db1")),
>  tableType = CatalogTableType.MANAGED,
>  storage = storageFormat,
>  schema = new StructType().add("col1", "int").add("col2", "string"),
>  provider = Some("hive"))
>  catalog.createTable(hiveTable, ignoreIfExists = false)
>  val rawTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(rawTable.provider === Some("hive"))
>  val fooTable = rawTable.copy(provider = Some("foo"))
>  catalog.alterTable(fooTable)
>  val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(alteredTable.provider === Some("foo"))
> }
> `
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32156) Refactor two similar test cases from SPARK-31061 in HiveExternalCatalogSuite

2020-07-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32156:
-

Assignee: JinxinTang

> Refactor two similar test cases from SPARK-31061 in HiveExternalCatalogSuite
> 
>
> Key: SPARK-32156
> URL: https://issues.apache.org/jira/browse/SPARK-32156
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: JinxinTang
>Assignee: JinxinTang
>Priority: Major
> Fix For: 3.0.0
>
>
> In `org.apache.spark.sql.hive.HiveExternalCatalogSuite`
> `
> test("SPARK-31061: alterTable should be able to change table provider") {
>  val catalog = newBasicCatalog()
>  val parquetTable = CatalogTable(
>  identifier = TableIdentifier("parq_tbl", Some("db1")),
>  tableType = CatalogTableType.MANAGED,
>  storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))),
>  schema = new StructType().add("col1", "int").add("col2", "string"),
>  provider = Some("parquet"))
>  catalog.createTable(parquetTable, ignoreIfExists = false)
>  val rawTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(rawTable.provider === Some("parquet"))
>  val fooTable = *parquetTable*.copy(provider = Some("foo")) <- 
> `*parquetTable*` seems should be rawTable
>  catalog.alterTable(fooTable)
>  val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(alteredTable.provider === Some("foo"))
> }
> test("SPARK-31061: alterTable should be able to change table provider from 
> hive") {
>  val catalog = newBasicCatalog()
>  val hiveTable = CatalogTable(
>  identifier = TableIdentifier("parq_tbl", Some("db1")),
>  tableType = CatalogTableType.MANAGED,
>  storage = storageFormat,
>  schema = new StructType().add("col1", "int").add("col2", "string"),
>  provider = Some("hive"))
>  catalog.createTable(hiveTable, ignoreIfExists = false)
>  val rawTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(rawTable.provider === Some("hive"))
>  val fooTable = rawTable.copy(provider = Some("foo"))
>  catalog.alterTable(fooTable)
>  val alteredTable = externalCatalog.getTable("db1", "parq_tbl")
>  assert(alteredTable.provider === Some("foo"))
> }
> `
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32158) Add JSONOptions to toJSON

2020-07-02 Thread German Schiavon Matteo (Jira)

German Schiavon Matteo created SPARK-32158:
--

 Summary: Add JSONOptions to toJSON
 Key: SPARK-32158
 URL: https://issues.apache.org/jira/browse/SPARK-32158
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: German Schiavon Matteo
 Fix For: 3.0.1, 3.1.0


Actually when calling `toJSON` on a dataFrame with null values, it doesn't 
print them.

Basically the same idea than https://issues.apache.org/jira/browse/SPARK-23772.

 
{code:java}
val df = spark.sparkContext.parallelize(Seq("1", "2", null)).toDF("col1")
df.toJSON -> {"col1":"1"},{"col1":"2"},{}{code}
 

After the PR:
{code:java}
val result = df.toJSON(Map("ignoreNullFields" -> 
"false")).collect().mkString(",")
val expected = """{"col1":"1"},{"col1":"2"},{"col1":null}"""
{code}
[~maropu] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Erik Erlandson (Jira)

Erik Erlandson created SPARK-32159:
--

 Summary: New udaf(Aggregator) has an integration bug with 
UnresolvedMapObjects serialization
 Key: SPARK-32159
 URL: https://issues.apache.org/jira/browse/SPARK-32159
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Erik Erlandson


The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

{{
/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
case class UnresolvedMapObjects(
@transient function: Expression => Expression,
child: Expression,
customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
  override lazy val resolved = false

  override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse {
throw new UnsupportedOperationException("not resolved")
  }
}

}} 

The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'

{{
object MapObjects {
  def apply(
  function: Expression => Expression,
  inputData: Expression,
  elementType: DataType,
  elementNullable: Boolean = true,
  customCollectionCls: Option[Class[_]] = None): MapObjects = {
val loopVar = LambdaVariable("MapObject", elementType, elementNullable)
MapObjects(loopVar, function(loopVar), inputData, customCollectionCls)
  }
}
}}

I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Erik Erlandson (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-32159:
---
Description: 
The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

{{/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
case class UnresolvedMapObjects(
@transient function: Expression => Expression,
child: Expression,
customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
  override lazy val resolved = false

  override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse {
throw new UnsupportedOperationException("not resolved")
  }
}}}

The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'

{{object MapObjects {
  def apply(
  function: Expression => Expression,
  inputData: Expression,
  elementType: DataType,
  elementNullable: Boolean = true,
  customCollectionCls: Option[Class[_]] = None): MapObjects = {
val loopVar = LambdaVariable("MapObject", elementType, elementNullable)
MapObjects(loopVar, function(loopVar), inputData, customCollectionCls)
  }
}
}}

I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be

  was:
The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

{{
/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
case class UnresolvedMapObjects(
@transient function: Expression => Expression,
child: Expression,
customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
  override lazy val resolved = false

  override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse {
throw new UnsupportedOperationException("not resolved")
  }
}

}} 

The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'

{{
object MapObjects {
  def apply(
  function: Expression => Expression,
  inputData: Expression,
  elementType: DataType,
  elementNullable: Boolean = true,
  customCollectionCls: Option[Class[_]] = None): MapObjects = {
val loopVar = LambdaVariable("MapObject", elementType, elementNullable)
MapObjects(loopVar, function(loopVar), inputData, customCollectionCls)
  }
}
}}

I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be


> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: Sp

[jira] [Commented] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Erik Erlandson (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150490#comment-17150490
 ] 

Erik Erlandson commented on SPARK-32159:


cc [~cloud_fan]

> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Priority: Major
>
> The new user defined aggregator feature (SPARK-27296) based on calling 
> 'functions.udaf(aggregator)' works fine when the aggregator input type is 
> atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an 
> array, like 'Aggregator[Array[Double], _, _]',  it is tripping over the 
> following:
> {{/**
>  * When constructing [[MapObjects]], the element type must be given, which 
> may not be available
>  * before analysis. This class acts like a placeholder for [[MapObjects]], 
> and will be replaced by
>  * [[MapObjects]] during analysis after the input data is resolved.
>  * Note that, ideally we should not serialize and send unresolved expressions 
> to executors, but
>  * users may accidentally do this(e.g. mistakenly reference an encoder 
> instance when implementing
>  * Aggregator). Here we mark `function` as transient because it may reference 
> scala Type, which is
>  * not serializable. Then even users mistakenly reference unresolved 
> expression and serialize it,
>  * it's just a performance issue(more network traffic), and will not fail.
>  */
> case class UnresolvedMapObjects(
> @transient function: Expression => Expression,
> child: Expression,
> customCollectionCls: Option[Class[_]] = None) extends UnaryExpression 
> with Unevaluable {
>   override lazy val resolved = false
>   override def dataType: DataType = 
> customCollectionCls.map(ObjectType.apply).getOrElse {
> throw new UnsupportedOperationException("not resolved")
>   }
> }}}
> The '@transient' is causing the function to be unpacked as 'null' over on the 
> executors, and it is causing a null-pointer exception here, when it tries to 
> do 'function(loopVar)'
> {{object MapObjects {
>   def apply(
>   function: Expression => Expression,
>   inputData: Expression,
>   elementType: DataType,
>   elementNullable: Boolean = true,
>   customCollectionCls: Option[Class[_]] = None): MapObjects = {
> val loopVar = LambdaVariable("MapObject", elementType, elementNullable)
> MapObjects(loopVar, function(loopVar), inputData, customCollectionCls)
>   }
> }
> }}
> I believe it may be possible to just use 'loopVar' instead of 
> 'function(loopVar)', whenever 'function' is null, but need second opinion 
> from catalyst developers on what a robust fix should be



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Erik Erlandson (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-32159:
---
Description: 
The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
 case class UnresolvedMapObjects(
 {color:#de350b}@transient function: Expression => Expression{color},
 child: Expression,
 customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
 override lazy val resolved = false

override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse

{ throw new UnsupportedOperationException("not resolved") }

}

 

*The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'*

object MapObjects {
 def apply(
 function: Expression => Expression,
 inputData: Expression,
 elementType: DataType,
 elementNullable: Boolean = true,
 customCollectionCls: Option[Class[_]] = None): MapObjects =

{ val loopVar = LambdaVariable("MapObject", elementType, elementNullable) 
MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, 
customCollectionCls) }

}

*I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be*

  was:
The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
 case class UnresolvedMapObjects(
 {color:#de350b}@transient function: Expression => Expression{color},
 child: Expression,
 customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
 override lazy val resolved = false

override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse

{ throw new UnsupportedOperationException("not resolved") }

}

 

*The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'*

{{object MapObjects {
 def apply(
 function: Expression => Expression,
 inputData: Expression,
 elementType: DataType,
 elementNullable: Boolean = true,
 customCollectionCls: Option[Class[_]] = None): MapObjects =

{ val loopVar = LambdaVariable("MapObject", elementType, elementNullable) 
MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, 
customCollectionCls) }

}
 }}

*I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be*


> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: Spark
>

[jira] [Updated] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Erik Erlandson (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Erlandson updated SPARK-32159:
---
Description: 
The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
 case class UnresolvedMapObjects(
 {color:#de350b}@transient function: Expression => Expression{color},
 child: Expression,
 customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
 override lazy val resolved = false

override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse

{ throw new UnsupportedOperationException("not resolved") }

}

 

*The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'*

{{object MapObjects {
 def apply(
 function: Expression => Expression,
 inputData: Expression,
 elementType: DataType,
 elementNullable: Boolean = true,
 customCollectionCls: Option[Class[_]] = None): MapObjects =

{ val loopVar = LambdaVariable("MapObject", elementType, elementNullable) 
MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, 
customCollectionCls) }

}
 }}

*I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be*

  was:
The new user defined aggregator feature (SPARK-27296) based on calling 
'functions.udaf(aggregator)' works fine when the aggregator input type is 
atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, 
like 'Aggregator[Array[Double], _, _]',  it is tripping over the following:

{{/**
 * When constructing [[MapObjects]], the element type must be given, which may 
not be available
 * before analysis. This class acts like a placeholder for [[MapObjects]], and 
will be replaced by
 * [[MapObjects]] during analysis after the input data is resolved.
 * Note that, ideally we should not serialize and send unresolved expressions 
to executors, but
 * users may accidentally do this(e.g. mistakenly reference an encoder instance 
when implementing
 * Aggregator). Here we mark `function` as transient because it may reference 
scala Type, which is
 * not serializable. Then even users mistakenly reference unresolved expression 
and serialize it,
 * it's just a performance issue(more network traffic), and will not fail.
 */
case class UnresolvedMapObjects(
@transient function: Expression => Expression,
child: Expression,
customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
Unevaluable {
  override lazy val resolved = false

  override def dataType: DataType = 
customCollectionCls.map(ObjectType.apply).getOrElse {
throw new UnsupportedOperationException("not resolved")
  }
}}}

The '@transient' is causing the function to be unpacked as 'null' over on the 
executors, and it is causing a null-pointer exception here, when it tries to do 
'function(loopVar)'

{{object MapObjects {
  def apply(
  function: Expression => Expression,
  inputData: Expression,
  elementType: DataType,
  elementNullable: Boolean = true,
  customCollectionCls: Option[Class[_]] = None): MapObjects = {
val loopVar = LambdaVariable("MapObject", elementType, elementNullable)
MapObjects(loopVar, function(loopVar), inputData, customCollectionCls)
  }
}
}}

I believe it may be possible to just use 'loopVar' instead of 
'function(loopVar)', whenever 'function' is null, but need second opinion from 
catalyst developers on what a robust fix should be


> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: Spark

[jira] [Commented] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()

2020-07-02 Thread Sudharshann D. (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150523#comment-17150523
 ] 

Sudharshann D. commented on SPARK-31579:


Hey [~maxgekk]. friendly ping once again!

> Replace floorDiv by / in localRebaseGregorianToJulianDays()
> ---
>
> Key: SPARK-31579
> URL: https://issues.apache.org/jira/browse/SPARK-31579
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>  Labels: starter
>
> Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check 
> that for all available time zones in the range of [0001, 2100] years with the 
> step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv 
> can be replaced by /, and this should improve performance of 
> RebaseDateTime.localRebaseGregorianToJulianDays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()

2020-07-02 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150535#comment-17150535
 ] 

Maxim Gekk commented on SPARK-31579:


[~suddhuASF] Please, open a PR for master.

> Replace floorDiv by / in localRebaseGregorianToJulianDays()
> ---
>
> Key: SPARK-31579
> URL: https://issues.apache.org/jira/browse/SPARK-31579
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>  Labels: starter
>
> Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check 
> that for all available time zones in the range of [0001, 2100] years with the 
> step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv 
> can be replaced by /, and this should improve performance of 
> RebaseDateTime.localRebaseGregorianToJulianDays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150565#comment-17150565
 ] 

Apache Spark commented on SPARK-32130:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28981

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Assignee: Maxim Gekk
>Priority: Critical
> Fix For: 3.0.1, 3.1.0
>
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
> To reproduce the issue please untar the attachment - it will have multiple 
> .json.gz files whose contents will look similar to following
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>  HNC IGD 
> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
> Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPer

[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150566#comment-17150566
 ] 

Apache Spark commented on SPARK-32130:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28981

> Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
> --
>
> Key: SPARK-32130
> URL: https://issues.apache.org/jira/browse/SPARK-32130
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
> Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, 
> sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 
> 10.0.0.8 instead (on interface en0)
> 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> Spark context Web UI available at http://10.0.0.8:4041
> Spark context available as 'sc' (master = local[*], app id = 
> local-1593442346864).
> Spark session available as 'spark'.
> Welcome to
>   __
>  / __/__ ___ _/ /__
>  _\ \/ _ \/ _ `/ __/ '_/
>  /___/ .__/\_,_/_/ /_/\_\ version 3.0.0
>  /_/
> Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_251)
> Type in expressions to have them evaluated.
> Type :help for more information.
>Reporter: Sanjeev Mishra
>Assignee: Maxim Gekk
>Priority: Critical
> Fix For: 3.0.1, 3.1.0
>
> Attachments: SPARK 32130 - replication and findings.ipynb, 
> small-anon.tar
>
>
> We are planning to move to Spark 3 but the read performance of our json files 
> is unacceptable. Following is the performance numbers when compared to Spark 
> 2.4
>  
>  Spark 2.4
> scala> spark.time(spark.read.json("/data/20200528"))
>  Time taken: {color:#ff}19691 ms{color}
>  res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res61.count())
>  Time taken: {color:#ff}7113 ms{color}
>  res64: Long = 2605349
>  Spark 3.0
>  scala> spark.time(spark.read.json("/data/20200528"))
>  20/06/29 08:06:53 WARN package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
>  Time taken: {color:#ff}849652 ms{color}
>  res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 
> more fields]
>  scala> spark.time(res0.count())
>  Time taken: {color:#ff}8201 ms{color}
>  res2: Long = 2605349
>   
>   
>  I am attaching a sample data (please delete is once you are able to 
> reproduce the issue) that is much smaller than the actual size but the 
> performance comparison can still be verified.
> The sample tar contains bunch of json.gz files, each line of the file is self 
> contained json doc as shown below
> To reproduce the issue please untar the attachment - it will have multiple 
> .json.gz files whose contents will look similar to following
>  
> {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS
>  HNC IGD","Annex F 
> Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports
>  Arris FastPath Speed 
> Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS
>  HNC IGD 
> EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service
>  Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First 
> Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPer

[jira] [Assigned] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32159:


Assignee: Apache Spark

> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Assignee: Apache Spark
>Priority: Major
>
> The new user defined aggregator feature (SPARK-27296) based on calling 
> 'functions.udaf(aggregator)' works fine when the aggregator input type is 
> atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an 
> array, like 'Aggregator[Array[Double], _, _]',  it is tripping over the 
> following:
> /**
>  * When constructing [[MapObjects]], the element type must be given, which 
> may not be available
>  * before analysis. This class acts like a placeholder for [[MapObjects]], 
> and will be replaced by
>  * [[MapObjects]] during analysis after the input data is resolved.
>  * Note that, ideally we should not serialize and send unresolved expressions 
> to executors, but
>  * users may accidentally do this(e.g. mistakenly reference an encoder 
> instance when implementing
>  * Aggregator). Here we mark `function` as transient because it may reference 
> scala Type, which is
>  * not serializable. Then even users mistakenly reference unresolved 
> expression and serialize it,
>  * it's just a performance issue(more network traffic), and will not fail.
>  */
>  case class UnresolvedMapObjects(
>  {color:#de350b}@transient function: Expression => Expression{color},
>  child: Expression,
>  customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
> Unevaluable {
>  override lazy val resolved = false
> override def dataType: DataType = 
> customCollectionCls.map(ObjectType.apply).getOrElse
> { throw new UnsupportedOperationException("not resolved") }
> }
>  
> *The '@transient' is causing the function to be unpacked as 'null' over on 
> the executors, and it is causing a null-pointer exception here, when it tries 
> to do 'function(loopVar)'*
> object MapObjects {
>  def apply(
>  function: Expression => Expression,
>  inputData: Expression,
>  elementType: DataType,
>  elementNullable: Boolean = true,
>  customCollectionCls: Option[Class[_]] = None): MapObjects =
> { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) 
> MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, 
> customCollectionCls) }
> }
> *I believe it may be possible to just use 'loopVar' instead of 
> 'function(loopVar)', whenever 'function' is null, but need second opinion 
> from catalyst developers on what a robust fix should be*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150590#comment-17150590
 ] 

Apache Spark commented on SPARK-32159:
--

User 'erikerlandson' has created a pull request for this issue:
https://github.com/apache/spark/pull/28983

> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Priority: Major
>
> The new user defined aggregator feature (SPARK-27296) based on calling 
> 'functions.udaf(aggregator)' works fine when the aggregator input type is 
> atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an 
> array, like 'Aggregator[Array[Double], _, _]',  it is tripping over the 
> following:
> /**
>  * When constructing [[MapObjects]], the element type must be given, which 
> may not be available
>  * before analysis. This class acts like a placeholder for [[MapObjects]], 
> and will be replaced by
>  * [[MapObjects]] during analysis after the input data is resolved.
>  * Note that, ideally we should not serialize and send unresolved expressions 
> to executors, but
>  * users may accidentally do this(e.g. mistakenly reference an encoder 
> instance when implementing
>  * Aggregator). Here we mark `function` as transient because it may reference 
> scala Type, which is
>  * not serializable. Then even users mistakenly reference unresolved 
> expression and serialize it,
>  * it's just a performance issue(more network traffic), and will not fail.
>  */
>  case class UnresolvedMapObjects(
>  {color:#de350b}@transient function: Expression => Expression{color},
>  child: Expression,
>  customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
> Unevaluable {
>  override lazy val resolved = false
> override def dataType: DataType = 
> customCollectionCls.map(ObjectType.apply).getOrElse
> { throw new UnsupportedOperationException("not resolved") }
> }
>  
> *The '@transient' is causing the function to be unpacked as 'null' over on 
> the executors, and it is causing a null-pointer exception here, when it tries 
> to do 'function(loopVar)'*
> object MapObjects {
>  def apply(
>  function: Expression => Expression,
>  inputData: Expression,
>  elementType: DataType,
>  elementNullable: Boolean = true,
>  customCollectionCls: Option[Class[_]] = None): MapObjects =
> { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) 
> MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, 
> customCollectionCls) }
> }
> *I believe it may be possible to just use 'loopVar' instead of 
> 'function(loopVar)', whenever 'function' is null, but need second opinion 
> from catalyst developers on what a robust fix should be*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32159:


Assignee: (was: Apache Spark)

> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Priority: Major
>
> The new user defined aggregator feature (SPARK-27296) based on calling 
> 'functions.udaf(aggregator)' works fine when the aggregator input type is 
> atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an 
> array, like 'Aggregator[Array[Double], _, _]',  it is tripping over the 
> following:
> /**
>  * When constructing [[MapObjects]], the element type must be given, which 
> may not be available
>  * before analysis. This class acts like a placeholder for [[MapObjects]], 
> and will be replaced by
>  * [[MapObjects]] during analysis after the input data is resolved.
>  * Note that, ideally we should not serialize and send unresolved expressions 
> to executors, but
>  * users may accidentally do this(e.g. mistakenly reference an encoder 
> instance when implementing
>  * Aggregator). Here we mark `function` as transient because it may reference 
> scala Type, which is
>  * not serializable. Then even users mistakenly reference unresolved 
> expression and serialize it,
>  * it's just a performance issue(more network traffic), and will not fail.
>  */
>  case class UnresolvedMapObjects(
>  {color:#de350b}@transient function: Expression => Expression{color},
>  child: Expression,
>  customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
> Unevaluable {
>  override lazy val resolved = false
> override def dataType: DataType = 
> customCollectionCls.map(ObjectType.apply).getOrElse
> { throw new UnsupportedOperationException("not resolved") }
> }
>  
> *The '@transient' is causing the function to be unpacked as 'null' over on 
> the executors, and it is causing a null-pointer exception here, when it tries 
> to do 'function(loopVar)'*
> object MapObjects {
>  def apply(
>  function: Expression => Expression,
>  inputData: Expression,
>  elementType: DataType,
>  elementNullable: Boolean = true,
>  customCollectionCls: Option[Class[_]] = None): MapObjects =
> { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) 
> MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, 
> customCollectionCls) }
> }
> *I believe it may be possible to just use 'loopVar' instead of 
> 'function(loopVar)', whenever 'function' is null, but need second opinion 
> from catalyst developers on what a robust fix should be*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150591#comment-17150591
 ] 

Apache Spark commented on SPARK-32159:
--

User 'erikerlandson' has created a pull request for this issue:
https://github.com/apache/spark/pull/28983

> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Priority: Major
>
> The new user defined aggregator feature (SPARK-27296) based on calling 
> 'functions.udaf(aggregator)' works fine when the aggregator input type is 
> atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an 
> array, like 'Aggregator[Array[Double], _, _]',  it is tripping over the 
> following:
> /**
>  * When constructing [[MapObjects]], the element type must be given, which 
> may not be available
>  * before analysis. This class acts like a placeholder for [[MapObjects]], 
> and will be replaced by
>  * [[MapObjects]] during analysis after the input data is resolved.
>  * Note that, ideally we should not serialize and send unresolved expressions 
> to executors, but
>  * users may accidentally do this(e.g. mistakenly reference an encoder 
> instance when implementing
>  * Aggregator). Here we mark `function` as transient because it may reference 
> scala Type, which is
>  * not serializable. Then even users mistakenly reference unresolved 
> expression and serialize it,
>  * it's just a performance issue(more network traffic), and will not fail.
>  */
>  case class UnresolvedMapObjects(
>  {color:#de350b}@transient function: Expression => Expression{color},
>  child: Expression,
>  customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
> Unevaluable {
>  override lazy val resolved = false
> override def dataType: DataType = 
> customCollectionCls.map(ObjectType.apply).getOrElse
> { throw new UnsupportedOperationException("not resolved") }
> }
>  
> *The '@transient' is causing the function to be unpacked as 'null' over on 
> the executors, and it is causing a null-pointer exception here, when it tries 
> to do 'function(loopVar)'*
> object MapObjects {
>  def apply(
>  function: Expression => Expression,
>  inputData: Expression,
>  elementType: DataType,
>  elementNullable: Boolean = true,
>  customCollectionCls: Option[Class[_]] = None): MapObjects =
> { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) 
> MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, 
> customCollectionCls) }
> }
> *I believe it may be possible to just use 'loopVar' instead of 
> 'function(loopVar)', whenever 'function' is null, but need second opinion 
> from catalyst developers on what a robust fix should be*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization

2020-07-02 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150593#comment-17150593
 ] 

Dongjoon Hyun commented on SPARK-32159:
---

Hi, [~eje]. Shall we set `Target Version` to `3.0.1`?

> New udaf(Aggregator) has an integration bug with UnresolvedMapObjects 
> serialization
> ---
>
> Key: SPARK-32159
> URL: https://issues.apache.org/jira/browse/SPARK-32159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Erik Erlandson
>Priority: Major
>
> The new user defined aggregator feature (SPARK-27296) based on calling 
> 'functions.udaf(aggregator)' works fine when the aggregator input type is 
> atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an 
> array, like 'Aggregator[Array[Double], _, _]',  it is tripping over the 
> following:
> /**
>  * When constructing [[MapObjects]], the element type must be given, which 
> may not be available
>  * before analysis. This class acts like a placeholder for [[MapObjects]], 
> and will be replaced by
>  * [[MapObjects]] during analysis after the input data is resolved.
>  * Note that, ideally we should not serialize and send unresolved expressions 
> to executors, but
>  * users may accidentally do this(e.g. mistakenly reference an encoder 
> instance when implementing
>  * Aggregator). Here we mark `function` as transient because it may reference 
> scala Type, which is
>  * not serializable. Then even users mistakenly reference unresolved 
> expression and serialize it,
>  * it's just a performance issue(more network traffic), and will not fail.
>  */
>  case class UnresolvedMapObjects(
>  {color:#de350b}@transient function: Expression => Expression{color},
>  child: Expression,
>  customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with 
> Unevaluable {
>  override lazy val resolved = false
> override def dataType: DataType = 
> customCollectionCls.map(ObjectType.apply).getOrElse
> { throw new UnsupportedOperationException("not resolved") }
> }
>  
> *The '@transient' is causing the function to be unpacked as 'null' over on 
> the executors, and it is causing a null-pointer exception here, when it tries 
> to do 'function(loopVar)'*
> object MapObjects {
>  def apply(
>  function: Expression => Expression,
>  inputData: Expression,
>  elementType: DataType,
>  elementNullable: Boolean = true,
>  customCollectionCls: Option[Class[_]] = None): MapObjects =
> { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) 
> MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, 
> customCollectionCls) }
> }
> *I believe it may be possible to just use 'loopVar' instead of 
> 'function(loopVar)', whenever 'function' is null, but need second opinion 
> from catalyst developers on what a robust fix should be*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31666) Cannot map hostPath volumes to container

2020-07-02 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150596#comment-17150596
 ] 

Dongjoon Hyun commented on SPARK-31666:
---

Apache Spark 3.0.0 is released last month and Apache Spark 3.1.0 is scheduled 
on December 2020.
- https://spark.apache.org/versioning-policy.html

> Cannot map hostPath volumes to container
> 
>
> Key: SPARK-31666
> URL: https://issues.apache.org/jira/browse/SPARK-31666
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5
>Reporter: Stephen Hopper
>Priority: Major
>
> I'm trying to mount additional hostPath directories as seen in a couple of 
> places:
> [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space]
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> However, whenever I try to submit my job, I run into this error:
> {code:java}
> Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │
>  io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. 
> Message: Pod "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath,
>  message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Pod, 
> name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=Pod 
> "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique, metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=Invalid, status=Failure, additionalProperties={}).{code}
>  
> This is my spark-submit command (note: I've used my own build of spark for 
> kubernetes as well as a few other images that I've seen floating around (such 
> as this one seedjeffwan/spark:v2.4.5) and they all have this same issue):
> {code:java}
> bin/spark-submit \
>  --master k8s://https://my-k8s-server:443 \
>  --deploy-mode cluster \
>  --name spark-pi \
>  --class org.apache.spark.examples.SparkPi \
>  --conf spark.executor.instances=2 \
>  --conf spark.kubernetes.container.image=my-spark-image:my-tag \
>  --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \
>  --conf spark.kubernetes.namespace=my-spark-ns \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 
> \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1
>  \
>  --conf spark.local.dir="/tmp1" \
>  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
>  local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code}
> Any ideas on what's causing this?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32158) Add JSONOptions to toJSON

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150598#comment-17150598
 ] 

Apache Spark commented on SPARK-32158:
--

User 'Gschiavon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28984

> Add JSONOptions to toJSON
> -
>
> Key: SPARK-32158
> URL: https://issues.apache.org/jira/browse/SPARK-32158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: German Schiavon Matteo
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> Actually when calling `toJSON` on a dataFrame with null values, it doesn't 
> print them.
> Basically the same idea than 
> https://issues.apache.org/jira/browse/SPARK-23772.
>  
> {code:java}
> val df = spark.sparkContext.parallelize(Seq("1", "2", null)).toDF("col1")
> df.toJSON -> {"col1":"1"},{"col1":"2"},{}{code}
>  
> After the PR:
> {code:java}
> val result = df.toJSON(Map("ignoreNullFields" -> 
> "false")).collect().mkString(",")
> val expected = """{"col1":"1"},{"col1":"2"},{"col1":null}"""
> {code}
> [~maropu] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32158) Add JSONOptions to toJSON

2020-07-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32158:


Assignee: Apache Spark

> Add JSONOptions to toJSON
> -
>
> Key: SPARK-32158
> URL: https://issues.apache.org/jira/browse/SPARK-32158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: German Schiavon Matteo
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> Actually when calling `toJSON` on a dataFrame with null values, it doesn't 
> print them.
> Basically the same idea than 
> https://issues.apache.org/jira/browse/SPARK-23772.
>  
> {code:java}
> val df = spark.sparkContext.parallelize(Seq("1", "2", null)).toDF("col1")
> df.toJSON -> {"col1":"1"},{"col1":"2"},{}{code}
>  
> After the PR:
> {code:java}
> val result = df.toJSON(Map("ignoreNullFields" -> 
> "false")).collect().mkString(",")
> val expected = """{"col1":"1"},{"col1":"2"},{"col1":null}"""
> {code}
> [~maropu] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32158) Add JSONOptions to toJSON

2020-07-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32158:


Assignee: (was: Apache Spark)

> Add JSONOptions to toJSON
> -
>
> Key: SPARK-32158
> URL: https://issues.apache.org/jira/browse/SPARK-32158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: German Schiavon Matteo
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> Actually when calling `toJSON` on a dataFrame with null values, it doesn't 
> print them.
> Basically the same idea than 
> https://issues.apache.org/jira/browse/SPARK-23772.
>  
> {code:java}
> val df = spark.sparkContext.parallelize(Seq("1", "2", null)).toDF("col1")
> df.toJSON -> {"col1":"1"},{"col1":"2"},{}{code}
>  
> After the PR:
> {code:java}
> val result = df.toJSON(Map("ignoreNullFields" -> 
> "false")).collect().mkString(",")
> val expected = """{"col1":"1"},{"col1":"2"},{"col1":null}"""
> {code}
> [~maropu] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32158) Add JSONOptions to toJSON

2020-07-02 Thread German Schiavon Matteo (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

German Schiavon Matteo updated SPARK-32158:
---
Description: 
Actually when calling `toJSON` on a dataFrame with null values, it doesn't 
print them.

Basically the same idea than https://issues.apache.org/jira/browse/SPARK-23772.

 
{code:java}
val df = spark.sparkContext.parallelize(Seq("1", "2", null)).toDF("col1")
df.toJSON -> {"col1":"1"},{"col1":"2"},{}{code}
 

After the PR:
{code:java}
val result = df.toJSON(Map("ignoreNullFields" -> 
"false")).collect().mkString(",")
val expected = """{"col1":"1"},{"col1":"2"},{"col1":null}"""
{code}
[~maropu] [~ueshin]

 

[https://github.com/apache/spark/pull/28984/]

  was:
Actually when calling `toJSON` on a dataFrame with null values, it doesn't 
print them.

Basically the same idea than https://issues.apache.org/jira/browse/SPARK-23772.

 
{code:java}
val df = spark.sparkContext.parallelize(Seq("1", "2", null)).toDF("col1")
df.toJSON -> {"col1":"1"},{"col1":"2"},{}{code}
 

After the PR:
{code:java}
val result = df.toJSON(Map("ignoreNullFields" -> 
"false")).collect().mkString(",")
val expected = """{"col1":"1"},{"col1":"2"},{"col1":null}"""
{code}
[~maropu] 


> Add JSONOptions to toJSON
> -
>
> Key: SPARK-32158
> URL: https://issues.apache.org/jira/browse/SPARK-32158
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: German Schiavon Matteo
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> Actually when calling `toJSON` on a dataFrame with null values, it doesn't 
> print them.
> Basically the same idea than 
> https://issues.apache.org/jira/browse/SPARK-23772.
>  
> {code:java}
> val df = spark.sparkContext.parallelize(Seq("1", "2", null)).toDF("col1")
> df.toJSON -> {"col1":"1"},{"col1":"2"},{}{code}
>  
> After the PR:
> {code:java}
> val result = df.toJSON(Map("ignoreNullFields" -> 
> "false")).collect().mkString(",")
> val expected = """{"col1":"1"},{"col1":"2"},{"col1":null}"""
> {code}
> [~maropu] [~ueshin]
>  
> [https://github.com/apache/spark/pull/28984/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31666) Cannot map hostPath volumes to container

2020-07-02 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150599#comment-17150599
 ] 

Dongjoon Hyun commented on SPARK-31666:
---

FYI, Apache Spark 2.4.0 was released at November 2, 2018. It's already over 18 
months. Apache Spark community wants to service the users a little longer with 
critical fixes like security and correctness issues. As a result, Apache Spark 
2.4.7 will be released soon again.
{quote}Feature release branches will, generally, be maintained with bug fix 
releases for a period of 18 months
{quote}

> Cannot map hostPath volumes to container
> 
>
> Key: SPARK-31666
> URL: https://issues.apache.org/jira/browse/SPARK-31666
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5
>Reporter: Stephen Hopper
>Priority: Major
>
> I'm trying to mount additional hostPath directories as seen in a couple of 
> places:
> [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space]
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> However, whenever I try to submit my job, I run into this error:
> {code:java}
> Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │
>  io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. 
> Message: Pod "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath,
>  message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Pod, 
> name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=Pod 
> "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique, metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=Invalid, status=Failure, additionalProperties={}).{code}
>  
> This is my spark-submit command (note: I've used my own build of spark for 
> kubernetes as well as a few other images that I've seen floating around (such 
> as this one seedjeffwan/spark:v2.4.5) and they all have this same issue):
> {code:java}
> bin/spark-submit \
>  --master k8s://https://my-k8s-server:443 \
>  --deploy-mode cluster \
>  --name spark-pi \
>  --class org.apache.spark.examples.SparkPi \
>  --conf spark.executor.instances=2 \
>  --conf spark.kubernetes.container.image=my-spark-image:my-tag \
>  --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \
>  --conf spark.kubernetes.namespace=my-spark-ns \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 
> \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1
>  \
>  --conf spark.local.dir="/tmp1" \
>  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
>  local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code}
> Any ideas on what's causing this?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31666) Cannot map hostPath volumes to container

2020-07-02 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150601#comment-17150601
 ] 

Dongjoon Hyun commented on SPARK-31666:
---

I linked SPARK-23529 since `hostPath` is added there.

> Cannot map hostPath volumes to container
> 
>
> Key: SPARK-31666
> URL: https://issues.apache.org/jira/browse/SPARK-31666
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5
>Reporter: Stephen Hopper
>Priority: Major
>
> I'm trying to mount additional hostPath directories as seen in a couple of 
> places:
> [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space]
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> However, whenever I try to submit my job, I run into this error:
> {code:java}
> Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │
>  io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. 
> Message: Pod "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath,
>  message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Pod, 
> name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=Pod 
> "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique, metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=Invalid, status=Failure, additionalProperties={}).{code}
>  
> This is my spark-submit command (note: I've used my own build of spark for 
> kubernetes as well as a few other images that I've seen floating around (such 
> as this one seedjeffwan/spark:v2.4.5) and they all have this same issue):
> {code:java}
> bin/spark-submit \
>  --master k8s://https://my-k8s-server:443 \
>  --deploy-mode cluster \
>  --name spark-pi \
>  --class org.apache.spark.examples.SparkPi \
>  --conf spark.executor.instances=2 \
>  --conf spark.kubernetes.container.image=my-spark-image:my-tag \
>  --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \
>  --conf spark.kubernetes.namespace=my-spark-ns \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 
> \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1
>  \
>  --conf spark.local.dir="/tmp1" \
>  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
>  local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code}
> Any ideas on what's causing this?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31666) Cannot map hostPath volumes to container

2020-07-02 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150601#comment-17150601
 ] 

Dongjoon Hyun edited comment on SPARK-31666 at 7/2/20, 9:50 PM:


I linked SPARK-23529 since `hostPath` is added there at 2.4.0.


was (Author: dongjoon):
I linked SPARK-23529 since `hostPath` is added there.

> Cannot map hostPath volumes to container
> 
>
> Key: SPARK-31666
> URL: https://issues.apache.org/jira/browse/SPARK-31666
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5
>Reporter: Stephen Hopper
>Priority: Major
>
> I'm trying to mount additional hostPath directories as seen in a couple of 
> places:
> [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space]
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> However, whenever I try to submit my job, I run into this error:
> {code:java}
> Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │
>  io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. 
> Message: Pod "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath,
>  message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Pod, 
> name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=Pod 
> "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique, metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=Invalid, status=Failure, additionalProperties={}).{code}
>  
> This is my spark-submit command (note: I've used my own build of spark for 
> kubernetes as well as a few other images that I've seen floating around (such 
> as this one seedjeffwan/spark:v2.4.5) and they all have this same issue):
> {code:java}
> bin/spark-submit \
>  --master k8s://https://my-k8s-server:443 \
>  --deploy-mode cluster \
>  --name spark-pi \
>  --class org.apache.spark.examples.SparkPi \
>  --conf spark.executor.instances=2 \
>  --conf spark.kubernetes.container.image=my-spark-image:my-tag \
>  --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \
>  --conf spark.kubernetes.namespace=my-spark-ns \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 
> \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1
>  \
>  --conf spark.local.dir="/tmp1" \
>  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
>  local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code}
> Any ideas on what's causing this?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25262) Support tmpfs for local dirs in k8s

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150605#comment-17150605
 ] 

Apache Spark commented on SPARK-25262:
--

User 'hopper-signifyd' has created a pull request for this issue:
https://github.com/apache/spark/pull/28985

> Support tmpfs for local dirs in k8s
> ---
>
> Key: SPARK-25262
> URL: https://issues.apache.org/jira/browse/SPARK-25262
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Rob Vesse
>Assignee: Rob Vesse
>Priority: Major
> Fix For: 3.0.0
>
>
> As discussed during review of the design document for SPARK-24434 while 
> providing pod templates will provide more in-depth customisation for Spark on 
> Kubernetes there are some things that cannot be modified because Spark code 
> generates pod specs in very specific ways.
> The particular issue identified relates to handling on {{spark.local.dirs}} 
> which is done by {{LocalDirsFeatureStep.scala}}.  For each directory 
> specified, or a single default if no explicit specification, it creates a 
> Kubernetes {{emptyDir}} volume.  As noted in the Kubernetes documentation 
> this will be backed by the node storage 
> (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).  In some 
> compute environments this may be extremely undesirable.  For example with 
> diskless compute resources the node storage will likely be a non-performant 
> remote mounted disk, often with limited capacity.  For such environments it 
> would likely be better to set {{medium: Memory}} on the volume per the K8S 
> documentation to use a {{tmpfs}} volume instead.
> Another closely related issue is that users might want to use a different 
> volume type to back the local directories and there is no possibility to do 
> that.
> Pod templates will not really solve either of these issues because Spark is 
> always going to attempt to generate a new volume for each local directory and 
> always going to set these as {{emptyDir}}.
> Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}:
> * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} 
> volumes
> * Modify the logic to check if there is a volume already defined with the 
> name and if so skip generating a volume definition for it



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31666) Cannot map hostPath volumes to container

2020-07-02 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150621#comment-17150621
 ] 

Dongjoon Hyun commented on SPARK-31666:
---

Hi, [~hopper-signifyd]. I found what is going on there.

SPARK-23529 works correctly like the following.

{code}
# minikube ssh ls /data
SPARK-31666.txt
{code}

{code}
export HTTP2_DISABLE=true
bin/spark-submit \
  --master k8s://$K8S_MASTER \
  --deploy-mode cluster \
  --name spark-pi \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.kubernetes.driverEnv.HTTP2_DISABLE=true \
  --conf spark.executor.instances=1 \
  --conf spark.kubernetes.container.image=spark/spark:v2.4.6 \
  --conf spark.kubernetes.executor.volumes.hostPath.data.mount.path=/data \
  --conf spark.kubernetes.executor.volumes.hostPath.data.options.path=/data 
\
  local:///opt/spark/examples/jars/spark-examples_2.11-2.4.6.jar 1
{code}

{code}
# k exec po/spark-pi-1593729363998-exec-1 -- ls /data
SPARK-31666.txt
{code}

Please see the error message `Invalid value: "/tmp1": must be unique.`. The 
error message occurs because `spark-local-dir-x` is already mounted as volume 
name by Spark. You should not use the same name.
{code}
20/07/02 15:38:39 INFO LoggingPodStatusWatcherImpl: State changed, new state:
 pod name: spark-pi-1593729518015-driver
 namespace: default
 labels: spark-app-selector -> spark-74b65a9a61cc46fd8bfc5e03e4b28bb8, 
spark-role -> driver
 pod uid: d838532b-eaa9-4b11-8eba-655f66965580
 creation time: 2020-07-02T22:38:39Z
 service account name: default
 volumes: spark-local-dir-1, spark-conf-volume, default-token-n5wwg
 node name: N/A
 start time: N/A
 container images: N/A
 phase: Pending
 status: []
{code}

> Cannot map hostPath volumes to container
> 
>
> Key: SPARK-31666
> URL: https://issues.apache.org/jira/browse/SPARK-31666
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5
>Reporter: Stephen Hopper
>Priority: Major
>
> I'm trying to mount additional hostPath directories as seen in a couple of 
> places:
> [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space]
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> However, whenever I try to submit my job, I run into this error:
> {code:java}
> Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │
>  io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. 
> Message: Pod "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath,
>  message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Pod, 
> name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=Pod 
> "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique, metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=Invalid, status=Failure, additionalProperties={}).{code}
>  
> This is my spark-submit command (note: I've used my own build of spark for 
> kubernetes as well as a few other images that I've seen floating around (such 
> as this one seedjeffwan/spark:v2.4.5) and they all have this same issue):
> {code:java}
> bin/spark-submit \
>  --master k8s://https://my-k8s-server:443 \
>  --deploy-mode cluster \
>  --name spark-pi \
>  --class org.apache.spark.examples.SparkPi \
>  --conf spark.executor.instances=2 \
>  --conf spark.kubernetes.container.image=my-spark-image:my-tag \
>  --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \
>  --conf spark.kubernetes.namespace=my-spark-ns \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 
> \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1
>  \
>  --conf spark.local.dir="/tmp1" \
>  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
>  local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code}
> Any ideas on what's causing this?
>  



--
This message w

[jira] [Commented] (SPARK-31666) Cannot map hostPath volumes to container

2020-07-02 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150622#comment-17150622
 ] 

Dongjoon Hyun commented on SPARK-31666:
---

So, "Cannot map hostPath volumes to container" is a wrong claim. It's a fair 
warning from K8s to prevent duplicated volume names. I'll close this issue.

> Cannot map hostPath volumes to container
> 
>
> Key: SPARK-31666
> URL: https://issues.apache.org/jira/browse/SPARK-31666
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5
>Reporter: Stephen Hopper
>Priority: Major
>
> I'm trying to mount additional hostPath directories as seen in a couple of 
> places:
> [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space]
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> However, whenever I try to submit my job, I run into this error:
> {code:java}
> Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │
>  io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. 
> Message: Pod "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath,
>  message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Pod, 
> name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=Pod 
> "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique, metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=Invalid, status=Failure, additionalProperties={}).{code}
>  
> This is my spark-submit command (note: I've used my own build of spark for 
> kubernetes as well as a few other images that I've seen floating around (such 
> as this one seedjeffwan/spark:v2.4.5) and they all have this same issue):
> {code:java}
> bin/spark-submit \
>  --master k8s://https://my-k8s-server:443 \
>  --deploy-mode cluster \
>  --name spark-pi \
>  --class org.apache.spark.examples.SparkPi \
>  --conf spark.executor.instances=2 \
>  --conf spark.kubernetes.container.image=my-spark-image:my-tag \
>  --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \
>  --conf spark.kubernetes.namespace=my-spark-ns \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 
> \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1
>  \
>  --conf spark.local.dir="/tmp1" \
>  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
>  local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code}
> Any ideas on what's causing this?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31666) Cannot map hostPath volumes to container

2020-07-02 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31666.
---
Resolution: Not A Problem

> Cannot map hostPath volumes to container
> 
>
> Key: SPARK-31666
> URL: https://issues.apache.org/jira/browse/SPARK-31666
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5
>Reporter: Stephen Hopper
>Priority: Major
>
> I'm trying to mount additional hostPath directories as seen in a couple of 
> places:
> [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space]
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> However, whenever I try to submit my job, I run into this error:
> {code:java}
> Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │
>  io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. 
> Message: Pod "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath,
>  message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Pod, 
> name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=Pod 
> "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique, metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=Invalid, status=Failure, additionalProperties={}).{code}
>  
> This is my spark-submit command (note: I've used my own build of spark for 
> kubernetes as well as a few other images that I've seen floating around (such 
> as this one seedjeffwan/spark:v2.4.5) and they all have this same issue):
> {code:java}
> bin/spark-submit \
>  --master k8s://https://my-k8s-server:443 \
>  --deploy-mode cluster \
>  --name spark-pi \
>  --class org.apache.spark.examples.SparkPi \
>  --conf spark.executor.instances=2 \
>  --conf spark.kubernetes.container.image=my-spark-image:my-tag \
>  --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \
>  --conf spark.kubernetes.namespace=my-spark-ns \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 
> \
>  --conf 
> spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1
>  \
>  --conf spark.local.dir="/tmp1" \
>  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
>  local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code}
> Any ideas on what's causing this?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32160) Executors should not be able to create SparkContext.

2020-07-02 Thread Takuya Ueshin (Jira)

Takuya Ueshin created SPARK-32160:
-

 Summary: Executors should not be able to create SparkContext.
 Key: SPARK-32160
 URL: https://issues.apache.org/jira/browse/SPARK-32160
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin


Currently executors can create SparkContext, but shouldn't be able to create it.
{code:scala}
sc.range(0, 1).foreach { _ =>
  new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32160) Executors should not be able to create SparkContext.

2020-07-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32160:


Assignee: Apache Spark

> Executors should not be able to create SparkContext.
> 
>
> Key: SPARK-32160
> URL: https://issues.apache.org/jira/browse/SPARK-32160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> Currently executors can create SparkContext, but shouldn't be able to create 
> it.
> {code:scala}
> sc.range(0, 1).foreach { _ =>
>   new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32160) Executors should not be able to create SparkContext.

2020-07-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32160:


Assignee: (was: Apache Spark)

> Executors should not be able to create SparkContext.
> 
>
> Key: SPARK-32160
> URL: https://issues.apache.org/jira/browse/SPARK-32160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Currently executors can create SparkContext, but shouldn't be able to create 
> it.
> {code:scala}
> sc.range(0, 1).foreach { _ =>
>   new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32160) Executors should not be able to create SparkContext.

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150633#comment-17150633
 ] 

Apache Spark commented on SPARK-32160:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/28986

> Executors should not be able to create SparkContext.
> 
>
> Key: SPARK-32160
> URL: https://issues.apache.org/jira/browse/SPARK-32160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Currently executors can create SparkContext, but shouldn't be able to create 
> it.
> {code:scala}
> sc.range(0, 1).foreach { _ =>
>   new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32160) Executors should not be able to create SparkContext.

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150634#comment-17150634
 ] 

Apache Spark commented on SPARK-32160:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/28986

> Executors should not be able to create SparkContext.
> 
>
> Key: SPARK-32160
> URL: https://issues.apache.org/jira/browse/SPARK-32160
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Currently executors can create SparkContext, but shouldn't be able to create 
> it.
> {code:scala}
> sc.range(0, 1).foreach { _ =>
>   new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32161) Hide JVM traceback for SparkUpgradeException

2020-07-02 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-32161:


 Summary: Hide JVM traceback for SparkUpgradeException
 Key: SPARK-32161
 URL: https://issues.apache.org/jira/browse/SPARK-32161
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


We added {{SparkUpgradeException}} which the JVM traceback is pretty useless. 
See also https://github.com/apache/spark/pull/28736/files#r449184881

It should better also whitelist and hide JVM traceback.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32162) Improve Pandas Grouped Map with Window test output

2020-07-02 Thread Bryan Cutler (Jira)

Bryan Cutler created SPARK-32162:


 Summary: Improve Pandas Grouped Map with Window test output
 Key: SPARK-32162
 URL: https://issues.apache.org/jira/browse/SPARK-32162
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Tests
Affects Versions: 3.0.0
Reporter: Bryan Cutler


The output of GroupedMapInPandasTests.test_grouped_over_window_with_key is not 
helpful, only gives 

{code}

==
FAIL: test_grouped_over_window_with_key 
(pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
--
Traceback (most recent call last):
  File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 588, 
in test_grouped_over_window_with_key
self.assertTrue(all([r[0] for r in result]))
AssertionError: False is not true

--
Ran 21 tests in 141.194s

FAILED (failures=1)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32162) Improve Pandas Grouped Map with Window test output

2020-07-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32162:


Assignee: (was: Apache Spark)

> Improve Pandas Grouped Map with Window test output
> --
>
> Key: SPARK-32162
> URL: https://issues.apache.org/jira/browse/SPARK-32162
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> The output of GroupedMapInPandasTests.test_grouped_over_window_with_key is 
> not helpful, only gives 
> {code}
> ==
> FAIL: test_grouped_over_window_with_key 
> (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
> --
> Traceback (most recent call last):
>   File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 588, in test_grouped_over_window_with_key
> self.assertTrue(all([r[0] for r in result]))
> AssertionError: False is not true
> --
> Ran 21 tests in 141.194s
> FAILED (failures=1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32162) Improve Pandas Grouped Map with Window test output

2020-07-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32162:


Assignee: Apache Spark

> Improve Pandas Grouped Map with Window test output
> --
>
> Key: SPARK-32162
> URL: https://issues.apache.org/jira/browse/SPARK-32162
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Minor
>
> The output of GroupedMapInPandasTests.test_grouped_over_window_with_key is 
> not helpful, only gives 
> {code}
> ==
> FAIL: test_grouped_over_window_with_key 
> (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
> --
> Traceback (most recent call last):
>   File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 588, in test_grouped_over_window_with_key
> self.assertTrue(all([r[0] for r in result]))
> AssertionError: False is not true
> --
> Ran 21 tests in 141.194s
> FAILED (failures=1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32162) Improve Pandas Grouped Map with Window test output

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150673#comment-17150673
 ] 

Apache Spark commented on SPARK-32162:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/28987

> Improve Pandas Grouped Map with Window test output
> --
>
> Key: SPARK-32162
> URL: https://issues.apache.org/jira/browse/SPARK-32162
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Tests
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> The output of GroupedMapInPandasTests.test_grouped_over_window_with_key is 
> not helpful, only gives 
> {code}
> ==
> FAIL: test_grouped_over_window_with_key 
> (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests)
> --
> Traceback (most recent call last):
>   File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 
> 588, in test_grouped_over_window_with_key
> self.assertTrue(all([r[0] for r in result]))
> AssertionError: False is not true
> --
> Ran 21 tests in 141.194s
> FAILED (failures=1)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations

2020-07-02 Thread L. C. Hsieh (Jira)

L. C. Hsieh created SPARK-32163:
---

 Summary: Nested pruning should still work for nested column 
extractors of attributes with cosmetic variations
 Key: SPARK-32163
 URL: https://issues.apache.org/jira/browse/SPARK-32163
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


If the expressions extracting nested fields have cosmetic variations like 
qualifier difference, currently nested column pruning cannot work well.

For example, two attributes which are semantically the same, are referred in a 
query, but the nested column extractors of them are treated differently when we 
deal with nested column pruning.








--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150678#comment-17150678
 ] 

Apache Spark commented on SPARK-32163:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/28988

> Nested pruning should still work for nested column extractors of attributes 
> with cosmetic variations
> 
>
> Key: SPARK-32163
> URL: https://issues.apache.org/jira/browse/SPARK-32163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> If the expressions extracting nested fields have cosmetic variations like 
> qualifier difference, currently nested column pruning cannot work well.
> For example, two attributes which are semantically the same, are referred in 
> a query, but the nested column extractors of them are treated differently 
> when we deal with nested column pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations

2020-07-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32163:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Nested pruning should still work for nested column extractors of attributes 
> with cosmetic variations
> 
>
> Key: SPARK-32163
> URL: https://issues.apache.org/jira/browse/SPARK-32163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> If the expressions extracting nested fields have cosmetic variations like 
> qualifier difference, currently nested column pruning cannot work well.
> For example, two attributes which are semantically the same, are referred in 
> a query, but the nested column extractors of them are treated differently 
> when we deal with nested column pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations

2020-07-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32163:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Nested pruning should still work for nested column extractors of attributes 
> with cosmetic variations
> 
>
> Key: SPARK-32163
> URL: https://issues.apache.org/jira/browse/SPARK-32163
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> If the expressions extracting nested fields have cosmetic variations like 
> qualifier difference, currently nested column pruning cannot work well.
> For example, two attributes which are semantically the same, are referred in 
> a query, but the nested column extractors of them are treated differently 
> when we deal with nested column pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations

2020-07-02 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-32163:

Issue Type: Bug  (was: Improvement)

> Nested pruning should still work for nested column extractors of attributes 
> with cosmetic variations
> 
>
> Key: SPARK-32163
> URL: https://issues.apache.org/jira/browse/SPARK-32163
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> If the expressions extracting nested fields have cosmetic variations like 
> qualifier difference, currently nested column pruning cannot work well.
> For example, two attributes which are semantically the same, are referred in 
> a query, but the nested column extractors of them are treated differently 
> when we deal with nested column pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150688#comment-17150688
 ] 

Apache Spark commented on SPARK-27194:
--

User 'turboFei' has created a pull request for this issue:
https://github.com/apache/spark/pull/28989

> Job failures when task attempts do not clean up spark-staging parquet files
> ---
>
> Key: SPARK-27194
> URL: https://issues.apache.org/jira/browse/SPARK-27194
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.1, 2.3.2, 2.3.3
>Reporter: Reza Safi
>Priority: Major
>
> When a container fails for some reason (for example when killed by yarn for 
> exceeding memory limits), the subsequent task attempts for the tasks that 
> were running on that container all fail with a FileAlreadyExistsException. 
> The original task attempt does not seem to successfully call abortTask (or at 
> least its "best effort" delete is unsuccessful) and clean up the parquet file 
> it was writing to, so when later task attempts try to write to the same 
> spark-staging directory using the same file name, the job fails.
> Here is what transpires in the logs:
> The container where task 200.0 is running is killed and the task is lost:
> {code}
> 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on 
> t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 
> GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
>  19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 
> 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 
> exited caused by one of the running tasks) Reason: Container killed by YARN 
> for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> {code}
> The task is re-attempted on a different executor and fails because the 
> part-00200-blah-blah.c000.snappy.parquet file from the first task attempt 
> already exists:
> {code}
> 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 
> (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task 
> failed while writing rows.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
>  Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
>  for client a.b.c.d already exists
> {code}
> The job fails when the the configured task attempts (spark.task.maxFailures) 
> have failed with the same error:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 
> in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage 
> 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task 
> failed while writing rows.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>  ...
>  Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
>  for client i.p.a.d already exists
> {code}
> SPARK-26682 wasn't the root cause here, since there wasn't any stage 
> reattempt.
> This issue seems to happen when 
> spark.sql.sources.partitionOverwriteMode=dynamic. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150689#comment-17150689
 ] 

Apache Spark commented on SPARK-29302:
--

User 'turboFei' has created a pull request for this issue:
https://github.com/apache/spark/pull/28989

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150690#comment-17150690
 ] 

Apache Spark commented on SPARK-27194:
--

User 'turboFei' has created a pull request for this issue:
https://github.com/apache/spark/pull/28989

> Job failures when task attempts do not clean up spark-staging parquet files
> ---
>
> Key: SPARK-27194
> URL: https://issues.apache.org/jira/browse/SPARK-27194
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.1, 2.3.2, 2.3.3
>Reporter: Reza Safi
>Priority: Major
>
> When a container fails for some reason (for example when killed by yarn for 
> exceeding memory limits), the subsequent task attempts for the tasks that 
> were running on that container all fail with a FileAlreadyExistsException. 
> The original task attempt does not seem to successfully call abortTask (or at 
> least its "best effort" delete is unsuccessful) and clean up the parquet file 
> it was writing to, so when later task attempts try to write to the same 
> spark-staging directory using the same file name, the job fails.
> Here is what transpires in the logs:
> The container where task 200.0 is running is killed and the task is lost:
> {code}
> 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on 
> t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 
> GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
>  19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage 
> 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 
> exited caused by one of the running tasks) Reason: Container killed by YARN 
> for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> {code}
> The task is re-attempted on a different executor and fails because the 
> part-00200-blah-blah.c000.snappy.parquet file from the first task attempt 
> already exists:
> {code}
> 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 
> (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task 
> failed while writing rows.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>  at org.apache.spark.scheduler.Task.run(Task.scala:109)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
>  Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
>  for client a.b.c.d already exists
> {code}
> The job fails when the the configured task attempts (spark.task.maxFailures) 
> have failed with the same error:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 
> in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage 
> 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task 
> failed while writing rows.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
>  ...
>  Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: 
> /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet
>  for client i.p.a.d already exists
> {code}
> SPARK-26682 wasn't the root cause here, since there wasn't any stage 
> reattempt.
> This issue seems to happen when 
> spark.sql.sources.partitionOverwriteMode=dynamic. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32164) GeneralizedLinearRegressionSummary optimization

2020-07-02 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-32164:


 Summary: GeneralizedLinearRegressionSummary optimization
 Key: SPARK-32164
 URL: https://issues.apache.org/jira/browse/SPARK-32164
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.1.0
Reporter: zhengruifeng


compute several statistics on single pass



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32164) GeneralizedLinearRegressionSummary optimization

2020-07-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32164:


Assignee: (was: Apache Spark)

> GeneralizedLinearRegressionSummary optimization
> ---
>
> Key: SPARK-32164
> URL: https://issues.apache.org/jira/browse/SPARK-32164
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> compute several statistics on single pass



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32164) GeneralizedLinearRegressionSummary optimization

2020-07-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150692#comment-17150692
 ] 

Apache Spark commented on SPARK-32164:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/28990

> GeneralizedLinearRegressionSummary optimization
> ---
>
> Key: SPARK-32164
> URL: https://issues.apache.org/jira/browse/SPARK-32164
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> compute several statistics on single pass



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32164) GeneralizedLinearRegressionSummary optimization

2020-07-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32164:


Assignee: Apache Spark

> GeneralizedLinearRegressionSummary optimization
> ---
>
> Key: SPARK-32164
> URL: https://issues.apache.org/jira/browse/SPARK-32164
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> compute several statistics on single pass



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession

2020-07-02 Thread Xianjin YE (Jira)

Xianjin YE created SPARK-32165:
--

 Summary: SessionState leaks SparkListener with multiple 
SparkSession
 Key: SPARK-32165
 URL: https://issues.apache.org/jira/browse/SPARK-32165
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xianjin YE


Copied from [https://github.com/apache/spark/pull/28128#issuecomment-653102770]

 
{code:java}
  test("SPARK-31354: SparkContext only register one SparkSession ApplicationEnd 
listener") {
val conf = new SparkConf()
  .setMaster("local")
  .setAppName("test-app-SPARK-31354-1")
val context = new SparkContext(conf)
SparkSession
  .builder()
  .sparkContext(context)
  .master("local")
  .getOrCreate()
  .sessionState // this touches the sessionState
val postFirstCreation = context.listenerBus.listeners.size()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()

SparkSession
  .builder()
  .sparkContext(context)
  .master("local")
  .getOrCreate()
  .sessionState // this touches the sessionState
val postSecondCreation = context.listenerBus.listeners.size()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
assert(postFirstCreation == postSecondCreation)
  }
{code}
The problem can be reproduced by the above code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32166) Metastore problem on Spark3.0 with Hive3.0

2020-07-02 Thread hzk (Jira)

hzk created SPARK-32166:
---

 Summary:  Metastore problem on Spark3.0 with Hive3.0
 Key: SPARK-32166
 URL: https://issues.apache.org/jira/browse/SPARK-32166
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: hzk


When i use spark-sql to create table ,the problem appear.
{code:java}
create table bigbig as select b.user_id , b.name , b.age , c.address , c.city , 
a.position , a.object , a.problem , a.complaint_time from ( select user_id , 
position , object , problem , complaint_time from 
HIVE_COMBINE_7efde4e2dcb34c218b3fb08872e698d5 ) as a left join 
HIVE_ODS_17_TEST_DEMO_ODS_USERS_INFO_20200608141945 as b on b.user_id = 
a.user_id left join HIVE_ODS_17_TEST_ADDRESS_CITY_20200608141942 as c on 
c.address_id = b.address_id;
{code}
It opened a connection to hive metastore.

my hive version is 3.1.0.
{code:java}
org.apache.thrift.TApplicationException: Required field 'filesAdded' is unset! 
Struct:InsertEventRequestData(filesAdded:null)org.apache.thrift.TApplicationException:
 Required field 'filesAdded' is unset! 
Struct:InsertEventRequestData(filesAdded:null) at 
org.apache.thrift.TApplicationException.read(TApplicationException.java:111) at 
org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_fire_listener_event(ThriftHiveMetastore.java:4182)
 at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.fire_listener_event(ThriftHiveMetastore.java:4169)
 at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.fireListenerEvent(HiveMetaStoreClient.java:1954)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)
 at com.sun.proxy.$Proxy5.fireListenerEvent(Unknown Source) at 
org.apache.hadoop.hive.ql.metadata.Hive.fireInsertEvent(Hive.java:1947) at 
org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1673) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:847) at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:757)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:757)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:757)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:756)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:829)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:827)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:827)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:827)
 at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:416)
 at 
org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:403) at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
 at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at 
org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at 
org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at 
org.apache.spark.sql.Dataset.(Dataset.scala:190) at 
org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) at 
org.apache.spark.sql.S

[jira] [Resolved] (SPARK-25594) OOM in long running applications even with UI disabled

2020-07-02 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-25594.
-
Resolution: Won't Fix

> OOM in long running applications even with UI disabled
> --
>
> Key: SPARK-25594
> URL: https://issues.apache.org/jira/browse/SPARK-25594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Mridul Muralidharan
>Assignee: Mridul Muralidharan
>Priority: Major
>
> Typically for long running applications with large number of tasks it is 
> common to disable UI to minimize overhead at driver.
> Earlier, with spark ui disabled, only stage/job information was kept as part 
> of JobProgressListener.
> As part of history server scalability fixes, particularly SPARK-20643, 
> inspite of disabling UI - task information continues to be maintained in 
> memory.
> In our long running tests against spark thrift server, this eventually 
> results in OOM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25594) OOM in long running applications even with UI disabled

2020-07-02 Thread Mridul Muralidharan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150729#comment-17150729
 ] 

Mridul Muralidharan commented on SPARK-25594:
-

Given regression in functionality if this is merged, closing bug.
See comment: https://github.com/apache/spark/pull/22609#issuecomment-426405757

> OOM in long running applications even with UI disabled
> --
>
> Key: SPARK-25594
> URL: https://issues.apache.org/jira/browse/SPARK-25594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Mridul Muralidharan
>Assignee: Mridul Muralidharan
>Priority: Major
>
> Typically for long running applications with large number of tasks it is 
> common to disable UI to minimize overhead at driver.
> Earlier, with spark ui disabled, only stage/job information was kept as part 
> of JobProgressListener.
> As part of history server scalability fixes, particularly SPARK-20643, 
> inspite of disabling UI - task information continues to be maintained in 
> memory.
> In our long running tests against spark thrift server, this eventually 
> results in OOM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

91 matches

Mail list logo