[jira] [Created] (SPARK-32152) ./bin/spark-sql got error with reading hive metastore
jung bak created SPARK-32152: Summary: ./bin/spark-sql got error with reading hive metastore Key: SPARK-32152 URL: https://issues.apache.org/jira/browse/SPARK-32152 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Environment: Spark 3.0.0 Hive 2.1.1 Reporter: jung bak 1. Fist of all, I built Spark3.0.0 from source with below command. {quote}{{./build/mvn -Pyarn -Phive -Phive-thriftserver -Dskip Tests clean package}} {quote} 2. I set the ${SPARK_HOME}/conf/spark-defaults.conf as below. {quote}spark.sql.hive.metastore.version 2.1.1 spark.sql.hive.metastore.jars {color:#FF}maven{color} {quote} 3. There is no problem to run "${SPARK_HOME}/bin/spark-sql" 4. For production environment, I copied all downloaded jar files from maven to ${SPARK_HOME}/lib/ 5. I changed ${SPARK_HOME}/conf/spark-defaluts.conf as below. {quote}spark.sql.hive.metastore.jars {color:#FF}${SPARK_HOME}/lib/{color} {quote} 6. Then I got error running command ./bin/spark-sql as below. {quote}Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException {quote} I found out that HiveException class is in the hive-exec-XXX.jar... Spark 3.0.0 was built with hive 2.3.7 by default, and I could find "hive-exec-2.3.7-core.jar" after I finished. and I could find hive-exec-2.1.1.jar downloaded from maven when I use "spark.sql.hive.metastore.jars maven" in the spark-defaults.conf. I thought that there are some conflict between hive 2.1.1 and hive 2.3.7 when I set the {color:#7a869a}spark.sql.hive.metastore.jars ${SPARK_HOME}/lib/.{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32152) ./bin/spark-sql got error with reading hive metastore
[ https://issues.apache.org/jira/browse/SPARK-32152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jung bak updated SPARK-32152: - Description: 1. Fist of all, I built Spark3.0.0 from source with below command. {quote}{{./build/mvn -Pyarn -Phive -Phive-thriftserver -Dskip Tests clean package}} {quote} 2. I set the ${SPARK_HOME}/conf/spark-defaults.conf as below. {quote}spark.sql.hive.metastore.version 2.1.1 spark.sql.hive.metastore.jars {color:#ff}maven{color} {quote} 3. There is no problem to run "${SPARK_HOME}/bin/spark-sql" 4. For production environment, I copied all downloaded jar files from maven to ${SPARK_HOME}/lib/ 5. I changed ${SPARK_HOME}/conf/spark-defaluts.conf as below. {quote}spark.sql.hive.metastore.jars {color:#ff}${SPARK_HOME}/lib/{color} {quote} 6. Then I got error running command ./bin/spark-sql as below. {quote}Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException {quote} I found out that HiveException class is in the hive-exec-XXX.jar... Spark 3.0.0 was built with hive 2.3.7 by default, and I could find "hive-exec-2.3.7-core.jar" after I finished build. And I could find hive-exec-2.1.1.jar downloaded from maven when I use "spark.sql.hive.metastore.jars maven" in the spark-defaults.conf. I thought that there are some conflict between hive 2.1.1 and hive 2.3.7 when I set the {color:#7a869a}spark.sql.hive.metastore.jars ${SPARK_HOME}/lib/.{color} was: 1. Fist of all, I built Spark3.0.0 from source with below command. {quote}{{./build/mvn -Pyarn -Phive -Phive-thriftserver -Dskip Tests clean package}} {quote} 2. I set the ${SPARK_HOME}/conf/spark-defaults.conf as below. {quote}spark.sql.hive.metastore.version 2.1.1 spark.sql.hive.metastore.jars {color:#FF}maven{color} {quote} 3. There is no problem to run "${SPARK_HOME}/bin/spark-sql" 4. For production environment, I copied all downloaded jar files from maven to ${SPARK_HOME}/lib/ 5. I changed ${SPARK_HOME}/conf/spark-defaluts.conf as below. {quote}spark.sql.hive.metastore.jars {color:#FF}${SPARK_HOME}/lib/{color} {quote} 6. Then I got error running command ./bin/spark-sql as below. {quote}Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException {quote} I found out that HiveException class is in the hive-exec-XXX.jar... Spark 3.0.0 was built with hive 2.3.7 by default, and I could find "hive-exec-2.3.7-core.jar" after I finished. and I could find hive-exec-2.1.1.jar downloaded from maven when I use "spark.sql.hive.metastore.jars maven" in the spark-defaults.conf. I thought that there are some conflict between hive 2.1.1 and hive 2.3.7 when I set the {color:#7a869a}spark.sql.hive.metastore.jars ${SPARK_HOME}/lib/.{color} > ./bin/spark-sql got error with reading hive metastore > - > > Key: SPARK-32152 > URL: https://issues.apache.org/jira/browse/SPARK-32152 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Spark 3.0.0 > Hive 2.1.1 >Reporter: jung bak >Priority: Major > > 1. Fist of all, I built Spark3.0.0 from source with below command. > {quote}{{./build/mvn -Pyarn -Phive -Phive-thriftserver -Dskip Tests clean > package}} > {quote} > 2. I set the ${SPARK_HOME}/conf/spark-defaults.conf as below. > {quote}spark.sql.hive.metastore.version 2.1.1 > spark.sql.hive.metastore.jars {color:#ff}maven{color} > {quote} > 3. There is no problem to run "${SPARK_HOME}/bin/spark-sql" > 4. For production environment, I copied all downloaded jar files from maven > to ${SPARK_HOME}/lib/ > 5. I changed ${SPARK_HOME}/conf/spark-defaluts.conf as below. > {quote}spark.sql.hive.metastore.jars > {color:#ff}${SPARK_HOME}/lib/{color} > {quote} > 6. Then I got error running command ./bin/spark-sql as below. > {quote}Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/hadoop/hive/ql/metadata/HiveException > {quote} > I found out that HiveException class is in the hive-exec-XXX.jar... > Spark 3.0.0 was built with hive 2.3.7 by default, and I could find > "hive-exec-2.3.7-core.jar" after I finished build. And I could find > hive-exec-2.1.1.jar downloaded from maven when I use > "spark.sql.hive.metastore.jars maven" in the spark-defaults.conf. > > I thought that there are some conflict between hive 2.1.1 and hive 2.3.7 when > I set the {color:#7a869a}spark.sql.hive.metastore.jars > ${SPARK_HOME}/lib/.{color} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32153) .m2 repository corruption can happen on Jenkins-worker4
Kousuke Saruta created SPARK-32153: -- Summary: .m2 repository corruption can happen on Jenkins-worker4 Key: SPARK-32153 URL: https://issues.apache.org/jira/browse/SPARK-32153 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.0.1, 3.1.0 Reporter: Kousuke Saruta Assignee: Shane Knapp Build task on Jenkins-worker4 often fails with dependency problem. [https://github.com/apache/spark/pull/28971#issuecomment-652611025] [https://github.com/apache/spark/pull/28942#issuecomment-652842960] These can be related to .m2 corruption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32153) .m2 repository corruption can happen on Jenkins-worker4
[ https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150048#comment-17150048 ] Kousuke Saruta commented on SPARK-32153: [~shaneknapp] Could you look into this? > .m2 repository corruption can happen on Jenkins-worker4 > --- > > Key: SPARK-32153 > URL: https://issues.apache.org/jira/browse/SPARK-32153 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Shane Knapp >Priority: Critical > > Build task on Jenkins-worker4 often fails with dependency problem. > [https://github.com/apache/spark/pull/28971#issuecomment-652611025] > [https://github.com/apache/spark/pull/28942#issuecomment-652842960] > These can be related to .m2 corruption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32153) .m2 repository corruption can happen on Jenkins-worker4
[ https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-32153: --- Issue Type: Bug (was: Improvement) > .m2 repository corruption can happen on Jenkins-worker4 > --- > > Key: SPARK-32153 > URL: https://issues.apache.org/jira/browse/SPARK-32153 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Shane Knapp >Priority: Critical > > Build task on Jenkins-worker4 often fails with dependency problem. > [https://github.com/apache/spark/pull/28971#issuecomment-652611025] > [https://github.com/apache/spark/pull/28942#issuecomment-652842960] > These can be related to .m2 corruption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32153) .m2 repository corruption can happen on Jenkins-worker4
[ https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-32153: --- Description: Build task on Jenkins-worker4 often fails with dependency problem. [https://github.com/apache/spark/pull/28971#issuecomment-652570066] https://github.com/apache/spark/pull/28971#issuecomment-652611025 [https://github.com/apache/spark/pull/28971#issuecomment-652690849] [https://github.com/apache/spark/pull/28942#issuecomment-652832012] |https://github.com/apache/spark/pull/28971#issuecomment-652611025 [https://github.com/apache/spark/pull/28942#issuecomment-652842960] [https://github.com/apache/spark/pull/28942#issuecomment-652835679]| These can be related to .m2 corruption. was: Build task on Jenkins-worker4 often fails with dependency problem. [https://github.com/apache/spark/pull/28971#issuecomment-652570066] [https://github.com/apache/spark/pull/28971#issuecomment-652611025 https://github.com/apache/spark/pull/28971#issuecomment-652690849 https://github.com/apache/spark/pull/28942#issuecomment-652832012 |https://github.com/apache/spark/pull/28971#issuecomment-652611025] [https://github.com/apache/spark/pull/28942#issuecomment-652842960] [https://github.com/apache/spark/pull/28942#issuecomment-652835679] These can be related to .m2 corruption. > .m2 repository corruption can happen on Jenkins-worker4 > --- > > Key: SPARK-32153 > URL: https://issues.apache.org/jira/browse/SPARK-32153 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Shane Knapp >Priority: Critical > > Build task on Jenkins-worker4 often fails with dependency problem. > [https://github.com/apache/spark/pull/28971#issuecomment-652570066] > https://github.com/apache/spark/pull/28971#issuecomment-652611025 > [https://github.com/apache/spark/pull/28971#issuecomment-652690849] > [https://github.com/apache/spark/pull/28942#issuecomment-652832012] > |https://github.com/apache/spark/pull/28971#issuecomment-652611025 > [https://github.com/apache/spark/pull/28942#issuecomment-652842960] > [https://github.com/apache/spark/pull/28942#issuecomment-652835679]| > > These can be related to .m2 corruption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32153) .m2 repository corruption can happen on Jenkins-worker4
[ https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-32153: --- Description: Build task on Jenkins-worker4 often fails with dependency problem. [https://github.com/apache/spark/pull/28971#issuecomment-652570066] [https://github.com/apache/spark/pull/28971#issuecomment-652611025] [https://github.com/apache/spark/pull/28971#issuecomment-652690849] [https://github.com/apache/spark/pull/28942#issuecomment-652832012 https://github.com/apache/spark/pull/28971#issuecomment-652611025 |https://github.com/apache/spark/pull/28942#issuecomment-652832012] [https://github.com/apache/spark/pull/28942#issuecomment-652842960] [ |https://github.com/apache/spark/pull/28942#issuecomment-652832012] [https://github.com/apache/spark/pull/28942#issuecomment-652835679] [|https://github.com/apache/spark/pull/28942#issuecomment-652832012] These can be related to .m2 corruption. was: Build task on Jenkins-worker4 often fails with dependency problem. [https://github.com/apache/spark/pull/28971#issuecomment-652570066] https://github.com/apache/spark/pull/28971#issuecomment-652611025 [https://github.com/apache/spark/pull/28971#issuecomment-652690849] [https://github.com/apache/spark/pull/28942#issuecomment-652832012] |https://github.com/apache/spark/pull/28971#issuecomment-652611025 [https://github.com/apache/spark/pull/28942#issuecomment-652842960] [https://github.com/apache/spark/pull/28942#issuecomment-652835679]| These can be related to .m2 corruption. > .m2 repository corruption can happen on Jenkins-worker4 > --- > > Key: SPARK-32153 > URL: https://issues.apache.org/jira/browse/SPARK-32153 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Shane Knapp >Priority: Critical > > Build task on Jenkins-worker4 often fails with dependency problem. > [https://github.com/apache/spark/pull/28971#issuecomment-652570066] > [https://github.com/apache/spark/pull/28971#issuecomment-652611025] > [https://github.com/apache/spark/pull/28971#issuecomment-652690849] > [https://github.com/apache/spark/pull/28942#issuecomment-652832012 > https://github.com/apache/spark/pull/28971#issuecomment-652611025 > |https://github.com/apache/spark/pull/28942#issuecomment-652832012] > [https://github.com/apache/spark/pull/28942#issuecomment-652842960] [ > |https://github.com/apache/spark/pull/28942#issuecomment-652832012] > [https://github.com/apache/spark/pull/28942#issuecomment-652835679] > [|https://github.com/apache/spark/pull/28942#issuecomment-652832012] > These can be related to .m2 corruption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32153) .m2 repository corruption can happen on Jenkins-worker4
[ https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-32153: --- Description: Build task on Jenkins-worker4 often fails with dependency problem. [https://github.com/apache/spark/pull/28971#issuecomment-652570066] [https://github.com/apache/spark/pull/28971#issuecomment-652611025] [https://github.com/apache/spark/pull/28971#issuecomment-652690849] [https://github.com/apache/spark/pull/28971#issuecomment-652611025] [https://github.com/apache/spark/pull/28942#issuecomment-652842960] [https://github.com/apache/spark/pull/28942#issuecomment-652835679] These can be related to .m2 corruption. was: Build task on Jenkins-worker4 often fails with dependency problem. [https://github.com/apache/spark/pull/28971#issuecomment-652570066] [https://github.com/apache/spark/pull/28971#issuecomment-652611025] [https://github.com/apache/spark/pull/28971#issuecomment-652690849] [https://github.com/apache/spark/pull/28942#issuecomment-652832012 https://github.com/apache/spark/pull/28971#issuecomment-652611025 |https://github.com/apache/spark/pull/28942#issuecomment-652832012] [https://github.com/apache/spark/pull/28942#issuecomment-652842960] [ |https://github.com/apache/spark/pull/28942#issuecomment-652832012] [https://github.com/apache/spark/pull/28942#issuecomment-652835679] [|https://github.com/apache/spark/pull/28942#issuecomment-652832012] These can be related to .m2 corruption. > .m2 repository corruption can happen on Jenkins-worker4 > --- > > Key: SPARK-32153 > URL: https://issues.apache.org/jira/browse/SPARK-32153 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Shane Knapp >Priority: Critical > > Build task on Jenkins-worker4 often fails with dependency problem. > [https://github.com/apache/spark/pull/28971#issuecomment-652570066] > [https://github.com/apache/spark/pull/28971#issuecomment-652611025] > [https://github.com/apache/spark/pull/28971#issuecomment-652690849] > [https://github.com/apache/spark/pull/28971#issuecomment-652611025] > [https://github.com/apache/spark/pull/28942#issuecomment-652842960] > [https://github.com/apache/spark/pull/28942#issuecomment-652835679] > These can be related to .m2 corruption. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32153) .m2 repository corruption can happen on Jenkins-worker4
[ https://issues.apache.org/jira/browse/SPARK-32153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-32153: --- Description: Build task on Jenkins-worker4 often fails with dependency problem. [https://github.com/apache/spark/pull/28971#issuecomment-652570066] [https://github.com/apache/spark/pull/28971#issuecomment-652611025 https://github.com/apache/spark/pull/28971#issuecomment-652690849 https://github.com/apache/spark/pull/28942#issuecomment-652832012 |https://github.com/apache/spark/pull/28971#issuecomment-652611025] [https://github.com/apache/spark/pull/28942#issuecomment-652842960] [https://github.com/apache/spark/pull/28942#issuecomment-652835679] These can be related to .m2 corruption. was: Build task on Jenkins-worker4 often fails with dependency problem. [https://github.com/apache/spark/pull/28971#issuecomment-652611025] [https://github.com/apache/spark/pull/28942#issuecomment-652842960] These can be related to .m2 corruption. > .m2 repository corruption can happen on Jenkins-worker4 > --- > > Key: SPARK-32153 > URL: https://issues.apache.org/jira/browse/SPARK-32153 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Shane Knapp >Priority: Critical > > Build task on Jenkins-worker4 often fails with dependency problem. > [https://github.com/apache/spark/pull/28971#issuecomment-652570066] > [https://github.com/apache/spark/pull/28971#issuecomment-652611025 > https://github.com/apache/spark/pull/28971#issuecomment-652690849 > https://github.com/apache/spark/pull/28942#issuecomment-652832012 > |https://github.com/apache/spark/pull/28971#issuecomment-652611025] > [https://github.com/apache/spark/pull/28942#issuecomment-652842960] > [https://github.com/apache/spark/pull/28942#issuecomment-652835679] > > These can be related to .m2 corruption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32154) Use ExpressionEncoder to serialize to catalyst type for the return type of ScalaUDF
wuyi created SPARK-32154: Summary: Use ExpressionEncoder to serialize to catalyst type for the return type of ScalaUDF Key: SPARK-32154 URL: https://issues.apache.org/jira/browse/SPARK-32154 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: wuyi Users now could register a UDF with Instant/LocalDate as return type even with spark.sql.datetime.java8API.enabled=false. However, the UDF can only be really used with spark.sql.datetime.java8API.enabled=true. This could make users confused. The problem is we use ExpressionEncoder to ser/deser types when registering the UDF, but use Catalyst converters to ser/deser types, which is under control of spark.sql.datetime.java8API.enabled, when executing UDF. If we could also use ExpressionEncoder to ser/deser types, similar to what we do for input parameter types, the, UDF could support Instant/LocalDate, event other combined complex types as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32154) Use ExpressionEncoder for the return type of ScalaUDF to serialize to catalyst type
[ https://issues.apache.org/jira/browse/SPARK-32154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-32154: - Summary: Use ExpressionEncoder for the return type of ScalaUDF to serialize to catalyst type (was: Use ExpressionEncoder to serialize to catalyst type for the return type of ScalaUDF) > Use ExpressionEncoder for the return type of ScalaUDF to serialize to > catalyst type > --- > > Key: SPARK-32154 > URL: https://issues.apache.org/jira/browse/SPARK-32154 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Users now could register a UDF with Instant/LocalDate as return type even > with > spark.sql.datetime.java8API.enabled=false. However, the UDF can only be > really used with spark.sql.datetime.java8API.enabled=true. This could make > users confused. > The problem is we use ExpressionEncoder to ser/deser types when registering > the UDF, but use Catalyst converters to ser/deser types, which is under > control of spark.sql.datetime.java8API.enabled, when executing UDF. > If we could also use ExpressionEncoder to ser/deser types, similar to what we > do for input parameter types, the, UDF could support Instant/LocalDate, event > other combined complex types as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32154) Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst type
[ https://issues.apache.org/jira/browse/SPARK-32154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-32154: - Summary: Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst type (was: Use ExpressionEncoder for the return type of ScalaUDF to serialize to catalyst type) > Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst > type > - > > Key: SPARK-32154 > URL: https://issues.apache.org/jira/browse/SPARK-32154 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Users now could register a UDF with Instant/LocalDate as return type even > with > spark.sql.datetime.java8API.enabled=false. However, the UDF can only be > really used with spark.sql.datetime.java8API.enabled=true. This could make > users confused. > The problem is we use ExpressionEncoder to ser/deser types when registering > the UDF, but use Catalyst converters to ser/deser types, which is under > control of spark.sql.datetime.java8API.enabled, when executing UDF. > If we could also use ExpressionEncoder to ser/deser types, similar to what we > do for input parameter types, the, UDF could support Instant/LocalDate, event > other combined complex types as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32155) Provide options for offset-based semantics when using structured streaming from a file stream source
Christopher Highman created SPARK-32155: --- Summary: Provide options for offset-based semantics when using structured streaming from a file stream source Key: SPARK-32155 URL: https://issues.apache.org/jira/browse/SPARK-32155 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.1.0 Reporter: Christopher Highman Implement the following options while performing structured streaming from a file data source: {code:java} startingOffsetsByTimestamp endingOffsetsByTimestamp startingOffsets endingOffsets {code} These options currently exist when using structured streaming from a Kafka data source. *Please see comments from the below PR for details.* [#28841|[https://github.com/apache/spark/pull/28841]] *Example from usage with Kafka data source* [http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32155) Provide options for offset-based semantics when using structured streaming from a file stream source
[ https://issues.apache.org/jira/browse/SPARK-32155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Highman updated SPARK-32155: Description: Implement the following options while performing structured streaming from a file data source: {code:java} startingOffsetsByTimestamp endingOffsetsByTimestamp startingOffsets endingOffsets {code} These options currently exist when using structured streaming from a Kafka data source. *Please see comments from the below PR for details.* [https://github.com/apache/spark/pull/28841] *Example from usage with Kafka data source* [http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries] was: Implement the following options while performing structured streaming from a file data source: {code:java} startingOffsetsByTimestamp endingOffsetsByTimestamp startingOffsets endingOffsets {code} These options currently exist when using structured streaming from a Kafka data source. *Please see comments from the below PR for details.* [#28841|[https://github.com/apache/spark/pull/28841]] *Example from usage with Kafka data source* [http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries] > Provide options for offset-based semantics when using structured streaming > from a file stream source > > > Key: SPARK-32155 > URL: https://issues.apache.org/jira/browse/SPARK-32155 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Christopher Highman >Priority: Minor > > Implement the following options while performing structured streaming from a > file data source: > {code:java} > startingOffsetsByTimestamp > endingOffsetsByTimestamp > startingOffsets > endingOffsets > {code} > These options currently exist when using structured streaming from a Kafka > data source. > *Please see comments from the below PR for details.* > [https://github.com/apache/spark/pull/28841] > *Example from usage with Kafka data source* > > [http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-batch-queries] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32154) Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst type
[ https://issues.apache.org/jira/browse/SPARK-32154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32154: Assignee: Apache Spark > Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst > type > - > > Key: SPARK-32154 > URL: https://issues.apache.org/jira/browse/SPARK-32154 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > Users now could register a UDF with Instant/LocalDate as return type even > with > spark.sql.datetime.java8API.enabled=false. However, the UDF can only be > really used with spark.sql.datetime.java8API.enabled=true. This could make > users confused. > The problem is we use ExpressionEncoder to ser/deser types when registering > the UDF, but use Catalyst converters to ser/deser types, which is under > control of spark.sql.datetime.java8API.enabled, when executing UDF. > If we could also use ExpressionEncoder to ser/deser types, similar to what we > do for input parameter types, the, UDF could support Instant/LocalDate, event > other combined complex types as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32154) Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst type
[ https://issues.apache.org/jira/browse/SPARK-32154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32154: Assignee: (was: Apache Spark) > Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst > type > - > > Key: SPARK-32154 > URL: https://issues.apache.org/jira/browse/SPARK-32154 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Users now could register a UDF with Instant/LocalDate as return type even > with > spark.sql.datetime.java8API.enabled=false. However, the UDF can only be > really used with spark.sql.datetime.java8API.enabled=true. This could make > users confused. > The problem is we use ExpressionEncoder to ser/deser types when registering > the UDF, but use Catalyst converters to ser/deser types, which is under > control of spark.sql.datetime.java8API.enabled, when executing UDF. > If we could also use ExpressionEncoder to ser/deser types, similar to what we > do for input parameter types, the, UDF could support Instant/LocalDate, event > other combined complex types as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32154) Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst type
[ https://issues.apache.org/jira/browse/SPARK-32154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150132#comment-17150132 ] Apache Spark commented on SPARK-32154: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28979 > Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst > type > - > > Key: SPARK-32154 > URL: https://issues.apache.org/jira/browse/SPARK-32154 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Users now could register a UDF with Instant/LocalDate as return type even > with > spark.sql.datetime.java8API.enabled=false. However, the UDF can only be > really used with spark.sql.datetime.java8API.enabled=true. This could make > users confused. > The problem is we use ExpressionEncoder to ser/deser types when registering > the UDF, but use Catalyst converters to ser/deser types, which is under > control of spark.sql.datetime.java8API.enabled, when executing UDF. > If we could also use ExpressionEncoder to ser/deser types, similar to what we > do for input parameter types, the, UDF could support Instant/LocalDate, event > other combined complex types as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32154) Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst type
[ https://issues.apache.org/jira/browse/SPARK-32154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150133#comment-17150133 ] Apache Spark commented on SPARK-32154: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28979 > Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst > type > - > > Key: SPARK-32154 > URL: https://issues.apache.org/jira/browse/SPARK-32154 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Users now could register a UDF with Instant/LocalDate as return type even > with > spark.sql.datetime.java8API.enabled=false. However, the UDF can only be > really used with spark.sql.datetime.java8API.enabled=true. This could make > users confused. > The problem is we use ExpressionEncoder to ser/deser types when registering > the UDF, but use Catalyst converters to ser/deser types, which is under > control of spark.sql.datetime.java8API.enabled, when executing UDF. > If we could also use ExpressionEncoder to ser/deser types, similar to what we > do for input parameter types, the, UDF could support Instant/LocalDate, event > other combined complex types as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32156) SPARK-31061 has two very similar tests could merge and somewhere could be improved
JinxinTang created SPARK-32156: -- Summary: SPARK-31061 has two very similar tests could merge and somewhere could be improved Key: SPARK-32156 URL: https://issues.apache.org/jira/browse/SPARK-32156 Project: Spark Issue Type: Test Components: Tests Affects Versions: 3.0.0 Reporter: JinxinTang Fix For: 3.0.0 In `org.apache.spark.sql.hive.HiveExternalCatalogSuite` ` test("SPARK-31061: alterTable should be able to change table provider") { val catalog = newBasicCatalog() val parquetTable = CatalogTable( identifier = TableIdentifier("parq_tbl", Some("db1")), tableType = CatalogTableType.MANAGED, storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))), schema = new StructType().add("col1", "int").add("col2", "string"), provider = Some("parquet")) catalog.createTable(parquetTable, ignoreIfExists = false) val rawTable = externalCatalog.getTable("db1", "parq_tbl") assert(rawTable.provider === Some("parquet")) val fooTable = *parquetTable*.copy(provider = Some("foo")) <- `*parquetTable*` seems should be rawTable catalog.alterTable(fooTable) val alteredTable = externalCatalog.getTable("db1", "parq_tbl") assert(alteredTable.provider === Some("foo")) } test("SPARK-31061: alterTable should be able to change table provider from hive") { val catalog = newBasicCatalog() val hiveTable = CatalogTable( identifier = TableIdentifier("parq_tbl", Some("db1")), tableType = CatalogTableType.MANAGED, storage = storageFormat, schema = new StructType().add("col1", "int").add("col2", "string"), provider = Some("hive")) catalog.createTable(hiveTable, ignoreIfExists = false) val rawTable = externalCatalog.getTable("db1", "parq_tbl") assert(rawTable.provider === Some("hive")) val fooTable = rawTable.copy(provider = Some("foo")) catalog.alterTable(fooTable) val alteredTable = externalCatalog.getTable("db1", "parq_tbl") assert(alteredTable.provider === Some("foo")) } ` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32156) SPARK-31061 has two very similar tests could merge and somewhere could be improved
[ https://issues.apache.org/jira/browse/SPARK-32156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32156: Assignee: Apache Spark > SPARK-31061 has two very similar tests could merge and somewhere could be > improved > -- > > Key: SPARK-32156 > URL: https://issues.apache.org/jira/browse/SPARK-32156 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: JinxinTang >Assignee: Apache Spark >Priority: Major > Fix For: 3.0.0 > > > In `org.apache.spark.sql.hive.HiveExternalCatalogSuite` > ` > test("SPARK-31061: alterTable should be able to change table provider") { > val catalog = newBasicCatalog() > val parquetTable = CatalogTable( > identifier = TableIdentifier("parq_tbl", Some("db1")), > tableType = CatalogTableType.MANAGED, > storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))), > schema = new StructType().add("col1", "int").add("col2", "string"), > provider = Some("parquet")) > catalog.createTable(parquetTable, ignoreIfExists = false) > val rawTable = externalCatalog.getTable("db1", "parq_tbl") > assert(rawTable.provider === Some("parquet")) > val fooTable = *parquetTable*.copy(provider = Some("foo")) <- > `*parquetTable*` seems should be rawTable > catalog.alterTable(fooTable) > val alteredTable = externalCatalog.getTable("db1", "parq_tbl") > assert(alteredTable.provider === Some("foo")) > } > test("SPARK-31061: alterTable should be able to change table provider from > hive") { > val catalog = newBasicCatalog() > val hiveTable = CatalogTable( > identifier = TableIdentifier("parq_tbl", Some("db1")), > tableType = CatalogTableType.MANAGED, > storage = storageFormat, > schema = new StructType().add("col1", "int").add("col2", "string"), > provider = Some("hive")) > catalog.createTable(hiveTable, ignoreIfExists = false) > val rawTable = externalCatalog.getTable("db1", "parq_tbl") > assert(rawTable.provider === Some("hive")) > val fooTable = rawTable.copy(provider = Some("foo")) > catalog.alterTable(fooTable) > val alteredTable = externalCatalog.getTable("db1", "parq_tbl") > assert(alteredTable.provider === Some("foo")) > } > ` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32156) SPARK-31061 has two very similar tests could merge and somewhere could be improved
[ https://issues.apache.org/jira/browse/SPARK-32156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32156: Assignee: (was: Apache Spark) > SPARK-31061 has two very similar tests could merge and somewhere could be > improved > -- > > Key: SPARK-32156 > URL: https://issues.apache.org/jira/browse/SPARK-32156 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: JinxinTang >Priority: Major > Fix For: 3.0.0 > > > In `org.apache.spark.sql.hive.HiveExternalCatalogSuite` > ` > test("SPARK-31061: alterTable should be able to change table provider") { > val catalog = newBasicCatalog() > val parquetTable = CatalogTable( > identifier = TableIdentifier("parq_tbl", Some("db1")), > tableType = CatalogTableType.MANAGED, > storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))), > schema = new StructType().add("col1", "int").add("col2", "string"), > provider = Some("parquet")) > catalog.createTable(parquetTable, ignoreIfExists = false) > val rawTable = externalCatalog.getTable("db1", "parq_tbl") > assert(rawTable.provider === Some("parquet")) > val fooTable = *parquetTable*.copy(provider = Some("foo")) <- > `*parquetTable*` seems should be rawTable > catalog.alterTable(fooTable) > val alteredTable = externalCatalog.getTable("db1", "parq_tbl") > assert(alteredTable.provider === Some("foo")) > } > test("SPARK-31061: alterTable should be able to change table provider from > hive") { > val catalog = newBasicCatalog() > val hiveTable = CatalogTable( > identifier = TableIdentifier("parq_tbl", Some("db1")), > tableType = CatalogTableType.MANAGED, > storage = storageFormat, > schema = new StructType().add("col1", "int").add("col2", "string"), > provider = Some("hive")) > catalog.createTable(hiveTable, ignoreIfExists = false) > val rawTable = externalCatalog.getTable("db1", "parq_tbl") > assert(rawTable.provider === Some("hive")) > val fooTable = rawTable.copy(provider = Some("foo")) > catalog.alterTable(fooTable) > val alteredTable = externalCatalog.getTable("db1", "parq_tbl") > assert(alteredTable.provider === Some("foo")) > } > ` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32156) SPARK-31061 has two very similar tests could merge and somewhere could be improved
[ https://issues.apache.org/jira/browse/SPARK-32156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150150#comment-17150150 ] Apache Spark commented on SPARK-32156: -- User 'TJX2014' has created a pull request for this issue: https://github.com/apache/spark/pull/28980 > SPARK-31061 has two very similar tests could merge and somewhere could be > improved > -- > > Key: SPARK-32156 > URL: https://issues.apache.org/jira/browse/SPARK-32156 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: JinxinTang >Priority: Major > Fix For: 3.0.0 > > > In `org.apache.spark.sql.hive.HiveExternalCatalogSuite` > ` > test("SPARK-31061: alterTable should be able to change table provider") { > val catalog = newBasicCatalog() > val parquetTable = CatalogTable( > identifier = TableIdentifier("parq_tbl", Some("db1")), > tableType = CatalogTableType.MANAGED, > storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))), > schema = new StructType().add("col1", "int").add("col2", "string"), > provider = Some("parquet")) > catalog.createTable(parquetTable, ignoreIfExists = false) > val rawTable = externalCatalog.getTable("db1", "parq_tbl") > assert(rawTable.provider === Some("parquet")) > val fooTable = *parquetTable*.copy(provider = Some("foo")) <- > `*parquetTable*` seems should be rawTable > catalog.alterTable(fooTable) > val alteredTable = externalCatalog.getTable("db1", "parq_tbl") > assert(alteredTable.provider === Some("foo")) > } > test("SPARK-31061: alterTable should be able to change table provider from > hive") { > val catalog = newBasicCatalog() > val hiveTable = CatalogTable( > identifier = TableIdentifier("parq_tbl", Some("db1")), > tableType = CatalogTableType.MANAGED, > storage = storageFormat, > schema = new StructType().add("col1", "int").add("col2", "string"), > provider = Some("hive")) > catalog.createTable(hiveTable, ignoreIfExists = false) > val rawTable = externalCatalog.getTable("db1", "parq_tbl") > assert(rawTable.provider === Some("hive")) > val fooTable = rawTable.copy(provider = Some("foo")) > catalog.alterTable(fooTable) > val alteredTable = externalCatalog.getTable("db1", "parq_tbl") > assert(alteredTable.provider === Some("foo")) > } > ` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32156) SPARK-31061 has two very similar tests could merge and somewhere could be improved
[ https://issues.apache.org/jira/browse/SPARK-32156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150153#comment-17150153 ] Apache Spark commented on SPARK-32156: -- User 'TJX2014' has created a pull request for this issue: https://github.com/apache/spark/pull/28980 > SPARK-31061 has two very similar tests could merge and somewhere could be > improved > -- > > Key: SPARK-32156 > URL: https://issues.apache.org/jira/browse/SPARK-32156 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: JinxinTang >Priority: Major > Fix For: 3.0.0 > > > In `org.apache.spark.sql.hive.HiveExternalCatalogSuite` > ` > test("SPARK-31061: alterTable should be able to change table provider") { > val catalog = newBasicCatalog() > val parquetTable = CatalogTable( > identifier = TableIdentifier("parq_tbl", Some("db1")), > tableType = CatalogTableType.MANAGED, > storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))), > schema = new StructType().add("col1", "int").add("col2", "string"), > provider = Some("parquet")) > catalog.createTable(parquetTable, ignoreIfExists = false) > val rawTable = externalCatalog.getTable("db1", "parq_tbl") > assert(rawTable.provider === Some("parquet")) > val fooTable = *parquetTable*.copy(provider = Some("foo")) <- > `*parquetTable*` seems should be rawTable > catalog.alterTable(fooTable) > val alteredTable = externalCatalog.getTable("db1", "parq_tbl") > assert(alteredTable.provider === Some("foo")) > } > test("SPARK-31061: alterTable should be able to change table provider from > hive") { > val catalog = newBasicCatalog() > val hiveTable = CatalogTable( > identifier = TableIdentifier("parq_tbl", Some("db1")), > tableType = CatalogTableType.MANAGED, > storage = storageFormat, > schema = new StructType().add("col1", "int").add("col2", "string"), > provider = Some("hive")) > catalog.createTable(hiveTable, ignoreIfExists = false) > val rawTable = externalCatalog.getTable("db1", "parq_tbl") > assert(rawTable.provider === Some("hive")) > val fooTable = rawTable.copy(provider = Some("foo")) > catalog.alterTable(fooTable) > val alteredTable = externalCatalog.getTable("db1", "parq_tbl") > assert(alteredTable.provider === Some("foo")) > } > ` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31061) Impossible to change the provider of a table in the HiveMetaStore
[ https://issues.apache.org/jira/browse/SPARK-31061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150156#comment-17150156 ] Apache Spark commented on SPARK-31061: -- User 'TJX2014' has created a pull request for this issue: https://github.com/apache/spark/pull/28980 > Impossible to change the provider of a table in the HiveMetaStore > - > > Key: SPARK-31061 > URL: https://issues.apache.org/jira/browse/SPARK-31061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > Currently, it's impossible to alter the datasource of a table in the > HiveMetaStore by using alterTable, as the HiveExternalCatalog doesn't change > the provider table property during an alterTable command. This is required to > support changing table formats when using commands like REPLACE TABLE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31061) Impossible to change the provider of a table in the HiveMetaStore
[ https://issues.apache.org/jira/browse/SPARK-31061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150154#comment-17150154 ] Apache Spark commented on SPARK-31061: -- User 'TJX2014' has created a pull request for this issue: https://github.com/apache/spark/pull/28980 > Impossible to change the provider of a table in the HiveMetaStore > - > > Key: SPARK-31061 > URL: https://issues.apache.org/jira/browse/SPARK-31061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > Currently, it's impossible to alter the datasource of a table in the > HiveMetaStore by using alterTable, as the HiveExternalCatalog doesn't change > the provider table property during an alterTable command. This is required to > support changing table formats when using commands like REPLACE TABLE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31061) Impossible to change the provider of a table in the HiveMetaStore
[ https://issues.apache.org/jira/browse/SPARK-31061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150155#comment-17150155 ] Apache Spark commented on SPARK-31061: -- User 'TJX2014' has created a pull request for this issue: https://github.com/apache/spark/pull/28980 > Impossible to change the provider of a table in the HiveMetaStore > - > > Key: SPARK-31061 > URL: https://issues.apache.org/jira/browse/SPARK-31061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > Currently, it's impossible to alter the datasource of a table in the > HiveMetaStore by using alterTable, as the HiveExternalCatalog doesn't change > the provider table property during an alterTable command. This is required to > support changing table formats when using commands like REPLACE TABLE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31061) Impossible to change the provider of a table in the HiveMetaStore
[ https://issues.apache.org/jira/browse/SPARK-31061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150157#comment-17150157 ] Apache Spark commented on SPARK-31061: -- User 'TJX2014' has created a pull request for this issue: https://github.com/apache/spark/pull/28980 > Impossible to change the provider of a table in the HiveMetaStore > - > > Key: SPARK-31061 > URL: https://issues.apache.org/jira/browse/SPARK-31061 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > Currently, it's impossible to alter the datasource of a table in the > HiveMetaStore by using alterTable, as the HiveExternalCatalog doesn't change > the provider table property during an alterTable command. This is required to > support changing table formats when using commands like REPLACE TABLE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32121) ExternalShuffleBlockResolverSuite failed on Windows
[ https://issues.apache.org/jira/browse/SPARK-32121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32121: Assignee: Cheng Pan > ExternalShuffleBlockResolverSuite failed on Windows > --- > > Key: SPARK-32121 > URL: https://issues.apache.org/jira/browse/SPARK-32121 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0, 3.0.1 > Environment: Windows 10 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > The method {code}ExecutorDiskUtils.createNormalizedInternedPathname{code} > should consider the Windows file separator. > {code} > [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.132 > s <<< FAILURE! - in > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite > [ERROR] > testNormalizeAndInternPathname(org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite) > Time elapsed: 0 s <<< FAILURE! > org.junit.ComparisonFailure: expected: but > was: > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.assertPathsMatch(ExternalShuffleBlockResolverSuite.java:160) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.testNormalizeAndInternPathname(ExternalShuffleBlockResolverSuite.java:149) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32121) ExternalShuffleBlockResolverSuite failed on Windows
[ https://issues.apache.org/jira/browse/SPARK-32121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32121. -- Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Issue resolved by pull request 28940 [https://github.com/apache/spark/pull/28940] > ExternalShuffleBlockResolverSuite failed on Windows > --- > > Key: SPARK-32121 > URL: https://issues.apache.org/jira/browse/SPARK-32121 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0, 3.0.1 > Environment: Windows 10 >Reporter: Cheng Pan >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > The method {code}ExecutorDiskUtils.createNormalizedInternedPathname{code} > should consider the Windows file separator. > {code} > [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.132 > s <<< FAILURE! - in > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite > [ERROR] > testNormalizeAndInternPathname(org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite) > Time elapsed: 0 s <<< FAILURE! > org.junit.ComparisonFailure: expected: but > was: > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.assertPathsMatch(ExternalShuffleBlockResolverSuite.java:160) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolverSuite.testNormalizeAndInternPathname(ExternalShuffleBlockResolverSuite.java:149) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32156) Refactor two similar test cases from SPARK-31061 in HiveExternalCatalogSuite
[ https://issues.apache.org/jira/browse/SPARK-32156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32156: - Summary: Refactor two similar test cases from SPARK-31061 in HiveExternalCatalogSuite (was: SPARK-31061 has two very similar tests could merge and somewhere could be improved) > Refactor two similar test cases from SPARK-31061 in HiveExternalCatalogSuite > > > Key: SPARK-32156 > URL: https://issues.apache.org/jira/browse/SPARK-32156 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: JinxinTang >Priority: Major > Fix For: 3.0.0 > > > In `org.apache.spark.sql.hive.HiveExternalCatalogSuite` > ` > test("SPARK-31061: alterTable should be able to change table provider") { > val catalog = newBasicCatalog() > val parquetTable = CatalogTable( > identifier = TableIdentifier("parq_tbl", Some("db1")), > tableType = CatalogTableType.MANAGED, > storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))), > schema = new StructType().add("col1", "int").add("col2", "string"), > provider = Some("parquet")) > catalog.createTable(parquetTable, ignoreIfExists = false) > val rawTable = externalCatalog.getTable("db1", "parq_tbl") > assert(rawTable.provider === Some("parquet")) > val fooTable = *parquetTable*.copy(provider = Some("foo")) <- > `*parquetTable*` seems should be rawTable > catalog.alterTable(fooTable) > val alteredTable = externalCatalog.getTable("db1", "parq_tbl") > assert(alteredTable.provider === Some("foo")) > } > test("SPARK-31061: alterTable should be able to change table provider from > hive") { > val catalog = newBasicCatalog() > val hiveTable = CatalogTable( > identifier = TableIdentifier("parq_tbl", Some("db1")), > tableType = CatalogTableType.MANAGED, > storage = storageFormat, > schema = new StructType().add("col1", "int").add("col2", "string"), > provider = Some("hive")) > catalog.createTable(hiveTable, ignoreIfExists = false) > val rawTable = externalCatalog.getTable("db1", "parq_tbl") > assert(rawTable.provider === Some("hive")) > val fooTable = rawTable.copy(provider = Some("foo")) > catalog.alterTable(fooTable) > val alteredTable = externalCatalog.getTable("db1", "parq_tbl") > assert(alteredTable.provider === Some("foo")) > } > ` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150298#comment-17150298 ] Hyukjin Kwon commented on SPARK-25433: -- [~fhoering], I plan to redesign the PySpark documentation and I would like to put this in the documentation. Are you still active? I will cc on the related JIRAs if you are still interested in contributing the documentation. > Add support for PEX in PySpark > -- > > Key: SPARK-25433 > URL: https://issues.apache.org/jira/browse/SPARK-25433 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.2 >Reporter: Fabian Höring >Priority: Minor > > The goal of this ticket is to ship and use custom code inside the spark > executors using [PEX|https://github.com/pantsbuild/pex] > This currently works fine with > [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] > (disadvantages are that you have a separate conda package repo and ship the > python interpreter all the time) > Basically the workflow is > * to zip the local conda environment ([conda > pack|https://github.com/conda/conda-pack] also works) > * ship it to each executor as an archive > * modify PYSPARK_PYTHON to the local conda environment > I think it can work the same way with virtual env. There is the SPARK-13587 > ticket to provide nice entry points to spark-submit and SparkContext but > zipping your local virtual env and then just changing PYSPARK_PYTHON env > variable should already work. > I also have seen this > [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. > But recreating the virtual env each time doesn't seem to be a very scalable > solution. If you have hundreds of executors it will retrieve the packages on > each excecutor and recreate your virtual environment each time. Same problem > with this proposal SPARK-16367 from what I understood. > Another problem with virtual env is that your local environment is not easily > shippable to another machine. In particular there is the relocatable option > (see > [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], > > [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] > which makes it very complicated for the user to ship the virtual env and be > sure it works. > And here is where pex comes in. It is a nice way to create a single > executable zip file with all dependencies included. You have the pex command > line tool to build your package and when it is built you are sure it works. > This is in my opinion the most elegant way to ship python code (better than > virtual env and conda) > The problem why it doesn't work out of the box is that there can be only one > single entry point. So just shipping the pex files and setting PYSPARK_PYTHON > to the pex files doesn't work. You can nevertheless tune the env variable > [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] > and runtime to provide different entry points. > PR: [https://github.com/apache/spark/pull/22422/files] > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31100) Detect namespace existence when setting namespace
[ https://issues.apache.org/jira/browse/SPARK-31100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31100. - Fix Version/s: 3.1.0 Assignee: Jackey Lee Resolution: Fixed > Detect namespace existence when setting namespace > - > > Key: SPARK-31100 > URL: https://issues.apache.org/jira/browse/SPARK-31100 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jackey Lee >Assignee: Jackey Lee >Priority: Major > Fix For: 3.1.0 > > > We should check if the namespace exists while calling "use namespace", and > throw NoSuchNamespaceException if namespace not exists. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32157) Integer overflow when constructing large query plan string
Tanel Kiis created SPARK-32157: -- Summary: Integer overflow when constructing large query plan string Key: SPARK-32157 URL: https://issues.apache.org/jira/browse/SPARK-32157 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Tanel Kiis When the length of the string representation of the query plan in org.apache.spark.sql.catalyst.util.StringUtils.PlanStringConcat goes above Integer.MAX_VALUE, then the query can end with either of these two exception: "spark.sql.maxPlanStringLength" was set to 0: {noformat} java.lang.NegativeArraySizeException at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:68) at java.lang.StringBuilder.(StringBuilder.java:101) at org.apache.spark.sql.catalyst.util.StringUtils$StringConcat.toString(StringUtils.scala:136) at org.apache.spark.sql.catalyst.util.StringUtils$PlanStringConcat.toString(StringUtils.scala:163) at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:208) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:95) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:944) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:396) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:380) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:269) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:829) {noformat} "spark.sql.maxPlanStringLength" was at the default value: {noformat} java.lang.StringIndexOutOfBoundsException: String index out of range: -47 at java.lang.String.substring(String.java:1967) at org.apache.spark.sql.catalyst.util.StringUtils$StringConcat.append(StringUtils.scala:123) at org.apache.spark.sql.execution.QueryExecution.$anonfun$toString$1(QueryExecution.scala:207) at org.apache.spark.sql.execution.QueryExecution.$anonfun$toString$1$adapted(QueryExecution.scala:207) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeString$1(TreeNode.scala:663) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeString$1$adapted(TreeNode.scala:662) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:662) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) at org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) at org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) at org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) at org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) at org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) at org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeString$3(TreeNode.scala:693) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeString$3$adapted(TreeNode.scala:691) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeN
[jira] [Commented] (SPARK-30132) Scala 2.13 compile errors from Hadoop LocalFileSystem subclasses
[ https://issues.apache.org/jira/browse/SPARK-30132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150387#comment-17150387 ] Dongjoon Hyun commented on SPARK-30132: --- Nice! Thanks! > Scala 2.13 compile errors from Hadoop LocalFileSystem subclasses > > > Key: SPARK-30132 > URL: https://issues.apache.org/jira/browse/SPARK-30132 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Priority: Minor > > A few classes in our test code extend Hadoop's LocalFileSystem. Scala 2.13 > returns a compile error here - not for the Spark code, but because the Hadoop > code (it says) illegally overrides appendFile() with slightly different > generic types in its return value. This code is valid Java, evidently, and > the code actually doesn't define any generic types, so, I even wonder if it's > a scalac bug. > So far I don't see a workaround for this. > This only affects the Hadoop 3.2 build, in that it comes up with respect to a > method new in Hadoop 3. (There is actually another instance of a similar > problem that affects Hadoop 2, but I can see a tiny hack workaround for it). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32157) Integer overflow when constructing large query plan string
[ https://issues.apache.org/jira/browse/SPARK-32157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanel Kiis resolved SPARK-32157. Resolution: Duplicate > Integer overflow when constructing large query plan string > --- > > Key: SPARK-32157 > URL: https://issues.apache.org/jira/browse/SPARK-32157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Tanel Kiis >Priority: Major > > When the length of the string representation of the query plan in > org.apache.spark.sql.catalyst.util.StringUtils.PlanStringConcat goes above > Integer.MAX_VALUE, then the query can end with either of these two exception: > "spark.sql.maxPlanStringLength" was set to 0: > {noformat} > java.lang.NegativeArraySizeException > at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:68) > at java.lang.StringBuilder.(StringBuilder.java:101) > at > org.apache.spark.sql.catalyst.util.StringUtils$StringConcat.toString(StringUtils.scala:136) > at > org.apache.spark.sql.catalyst.util.StringUtils$PlanStringConcat.toString(StringUtils.scala:163) > at > org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:208) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:95) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:944) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:396) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:380) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:269) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:829) > {noformat} > "spark.sql.maxPlanStringLength" was at the default value: > {noformat} > java.lang.StringIndexOutOfBoundsException: String index out of range: -47 > at java.lang.String.substring(String.java:1967) > at > org.apache.spark.sql.catalyst.util.StringUtils$StringConcat.append(StringUtils.scala:123) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$toString$1(QueryExecution.scala:207) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$toString$1$adapted(QueryExecution.scala:207) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeString$1(TreeNode.scala:663) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeString$1$adapted(TreeNode.scala:662) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:662) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.catalyst.trees.Tree
[jira] [Closed] (SPARK-32157) Integer overflow when constructing large query plan string
[ https://issues.apache.org/jira/browse/SPARK-32157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanel Kiis closed SPARK-32157. -- > Integer overflow when constructing large query plan string > --- > > Key: SPARK-32157 > URL: https://issues.apache.org/jira/browse/SPARK-32157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Tanel Kiis >Priority: Major > > When the length of the string representation of the query plan in > org.apache.spark.sql.catalyst.util.StringUtils.PlanStringConcat goes above > Integer.MAX_VALUE, then the query can end with either of these two exception: > "spark.sql.maxPlanStringLength" was set to 0: > {noformat} > java.lang.NegativeArraySizeException > at java.lang.AbstractStringBuilder.(AbstractStringBuilder.java:68) > at java.lang.StringBuilder.(StringBuilder.java:101) > at > org.apache.spark.sql.catalyst.util.StringUtils$StringConcat.toString(StringUtils.scala:136) > at > org.apache.spark.sql.catalyst.util.StringUtils$PlanStringConcat.toString(StringUtils.scala:163) > at > org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:208) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:95) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:944) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:396) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:380) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:269) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:829) > {noformat} > "spark.sql.maxPlanStringLength" was at the default value: > {noformat} > java.lang.StringIndexOutOfBoundsException: String index out of range: -47 > at java.lang.String.substring(String.java:1967) > at > org.apache.spark.sql.catalyst.util.StringUtils$StringConcat.append(StringUtils.scala:123) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$toString$1(QueryExecution.scala:207) > at > org.apache.spark.sql.execution.QueryExecution.$anonfun$toString$1$adapted(QueryExecution.scala:207) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeString$1(TreeNode.scala:663) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeString$1$adapted(TreeNode.scala:662) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:662) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:795) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.execution.InputAdapter.generateTreeString(WholeStageCodegenExec.scala:550) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:697) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$generateTreeStri
[jira] [Resolved] (SPARK-32156) Refactor two similar test cases from SPARK-31061 in HiveExternalCatalogSuite
[ https://issues.apache.org/jira/browse/SPARK-32156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-32156. --- Fix Version/s: (was: 3.0.0) 3.1.0 Resolution: Fixed Issue resolved by pull request 28980 [https://github.com/apache/spark/pull/28980] > Refactor two similar test cases from SPARK-31061 in HiveExternalCatalogSuite > > > Key: SPARK-32156 > URL: https://issues.apache.org/jira/browse/SPARK-32156 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: JinxinTang >Assignee: JinxinTang >Priority: Major > Fix For: 3.1.0 > > > In `org.apache.spark.sql.hive.HiveExternalCatalogSuite` > ` > test("SPARK-31061: alterTable should be able to change table provider") { > val catalog = newBasicCatalog() > val parquetTable = CatalogTable( > identifier = TableIdentifier("parq_tbl", Some("db1")), > tableType = CatalogTableType.MANAGED, > storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))), > schema = new StructType().add("col1", "int").add("col2", "string"), > provider = Some("parquet")) > catalog.createTable(parquetTable, ignoreIfExists = false) > val rawTable = externalCatalog.getTable("db1", "parq_tbl") > assert(rawTable.provider === Some("parquet")) > val fooTable = *parquetTable*.copy(provider = Some("foo")) <- > `*parquetTable*` seems should be rawTable > catalog.alterTable(fooTable) > val alteredTable = externalCatalog.getTable("db1", "parq_tbl") > assert(alteredTable.provider === Some("foo")) > } > test("SPARK-31061: alterTable should be able to change table provider from > hive") { > val catalog = newBasicCatalog() > val hiveTable = CatalogTable( > identifier = TableIdentifier("parq_tbl", Some("db1")), > tableType = CatalogTableType.MANAGED, > storage = storageFormat, > schema = new StructType().add("col1", "int").add("col2", "string"), > provider = Some("hive")) > catalog.createTable(hiveTable, ignoreIfExists = false) > val rawTable = externalCatalog.getTable("db1", "parq_tbl") > assert(rawTable.provider === Some("hive")) > val fooTable = rawTable.copy(provider = Some("foo")) > catalog.alterTable(fooTable) > val alteredTable = externalCatalog.getTable("db1", "parq_tbl") > assert(alteredTable.provider === Some("foo")) > } > ` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32156) Refactor two similar test cases from SPARK-31061 in HiveExternalCatalogSuite
[ https://issues.apache.org/jira/browse/SPARK-32156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-32156: - Assignee: JinxinTang > Refactor two similar test cases from SPARK-31061 in HiveExternalCatalogSuite > > > Key: SPARK-32156 > URL: https://issues.apache.org/jira/browse/SPARK-32156 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.0.0 >Reporter: JinxinTang >Assignee: JinxinTang >Priority: Major > Fix For: 3.0.0 > > > In `org.apache.spark.sql.hive.HiveExternalCatalogSuite` > ` > test("SPARK-31061: alterTable should be able to change table provider") { > val catalog = newBasicCatalog() > val parquetTable = CatalogTable( > identifier = TableIdentifier("parq_tbl", Some("db1")), > tableType = CatalogTableType.MANAGED, > storage = storageFormat.copy(locationUri = Some(new URI("file:/some/path"))), > schema = new StructType().add("col1", "int").add("col2", "string"), > provider = Some("parquet")) > catalog.createTable(parquetTable, ignoreIfExists = false) > val rawTable = externalCatalog.getTable("db1", "parq_tbl") > assert(rawTable.provider === Some("parquet")) > val fooTable = *parquetTable*.copy(provider = Some("foo")) <- > `*parquetTable*` seems should be rawTable > catalog.alterTable(fooTable) > val alteredTable = externalCatalog.getTable("db1", "parq_tbl") > assert(alteredTable.provider === Some("foo")) > } > test("SPARK-31061: alterTable should be able to change table provider from > hive") { > val catalog = newBasicCatalog() > val hiveTable = CatalogTable( > identifier = TableIdentifier("parq_tbl", Some("db1")), > tableType = CatalogTableType.MANAGED, > storage = storageFormat, > schema = new StructType().add("col1", "int").add("col2", "string"), > provider = Some("hive")) > catalog.createTable(hiveTable, ignoreIfExists = false) > val rawTable = externalCatalog.getTable("db1", "parq_tbl") > assert(rawTable.provider === Some("hive")) > val fooTable = rawTable.copy(provider = Some("foo")) > catalog.alterTable(fooTable) > val alteredTable = externalCatalog.getTable("db1", "parq_tbl") > assert(alteredTable.provider === Some("foo")) > } > ` > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32158) Add JSONOptions to toJSON
German Schiavon Matteo created SPARK-32158: -- Summary: Add JSONOptions to toJSON Key: SPARK-32158 URL: https://issues.apache.org/jira/browse/SPARK-32158 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: German Schiavon Matteo Fix For: 3.0.1, 3.1.0 Actually when calling `toJSON` on a dataFrame with null values, it doesn't print them. Basically the same idea than https://issues.apache.org/jira/browse/SPARK-23772. {code:java} val df = spark.sparkContext.parallelize(Seq("1", "2", null)).toDF("col1") df.toJSON -> {"col1":"1"},{"col1":"2"},{}{code} After the PR: {code:java} val result = df.toJSON(Map("ignoreNullFields" -> "false")).collect().mkString(",") val expected = """{"col1":"1"},{"col1":"2"},{"col1":null}""" {code} [~maropu] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization
Erik Erlandson created SPARK-32159: -- Summary: New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization Key: SPARK-32159 URL: https://issues.apache.org/jira/browse/SPARK-32159 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Erik Erlandson The new user defined aggregator feature (SPARK-27296) based on calling 'functions.udaf(aggregator)' works fine when the aggregator input type is atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, like 'Aggregator[Array[Double], _, _]', it is tripping over the following: {{ /** * When constructing [[MapObjects]], the element type must be given, which may not be available * before analysis. This class acts like a placeholder for [[MapObjects]], and will be replaced by * [[MapObjects]] during analysis after the input data is resolved. * Note that, ideally we should not serialize and send unresolved expressions to executors, but * users may accidentally do this(e.g. mistakenly reference an encoder instance when implementing * Aggregator). Here we mark `function` as transient because it may reference scala Type, which is * not serializable. Then even users mistakenly reference unresolved expression and serialize it, * it's just a performance issue(more network traffic), and will not fail. */ case class UnresolvedMapObjects( @transient function: Expression => Expression, child: Expression, customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with Unevaluable { override lazy val resolved = false override def dataType: DataType = customCollectionCls.map(ObjectType.apply).getOrElse { throw new UnsupportedOperationException("not resolved") } } }} The '@transient' is causing the function to be unpacked as 'null' over on the executors, and it is causing a null-pointer exception here, when it tries to do 'function(loopVar)' {{ object MapObjects { def apply( function: Expression => Expression, inputData: Expression, elementType: DataType, elementNullable: Boolean = true, customCollectionCls: Option[Class[_]] = None): MapObjects = { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) MapObjects(loopVar, function(loopVar), inputData, customCollectionCls) } } }} I believe it may be possible to just use 'loopVar' instead of 'function(loopVar)', whenever 'function' is null, but need second opinion from catalyst developers on what a robust fix should be -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization
[ https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Erlandson updated SPARK-32159: --- Description: The new user defined aggregator feature (SPARK-27296) based on calling 'functions.udaf(aggregator)' works fine when the aggregator input type is atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, like 'Aggregator[Array[Double], _, _]', it is tripping over the following: {{/** * When constructing [[MapObjects]], the element type must be given, which may not be available * before analysis. This class acts like a placeholder for [[MapObjects]], and will be replaced by * [[MapObjects]] during analysis after the input data is resolved. * Note that, ideally we should not serialize and send unresolved expressions to executors, but * users may accidentally do this(e.g. mistakenly reference an encoder instance when implementing * Aggregator). Here we mark `function` as transient because it may reference scala Type, which is * not serializable. Then even users mistakenly reference unresolved expression and serialize it, * it's just a performance issue(more network traffic), and will not fail. */ case class UnresolvedMapObjects( @transient function: Expression => Expression, child: Expression, customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with Unevaluable { override lazy val resolved = false override def dataType: DataType = customCollectionCls.map(ObjectType.apply).getOrElse { throw new UnsupportedOperationException("not resolved") } }}} The '@transient' is causing the function to be unpacked as 'null' over on the executors, and it is causing a null-pointer exception here, when it tries to do 'function(loopVar)' {{object MapObjects { def apply( function: Expression => Expression, inputData: Expression, elementType: DataType, elementNullable: Boolean = true, customCollectionCls: Option[Class[_]] = None): MapObjects = { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) MapObjects(loopVar, function(loopVar), inputData, customCollectionCls) } } }} I believe it may be possible to just use 'loopVar' instead of 'function(loopVar)', whenever 'function' is null, but need second opinion from catalyst developers on what a robust fix should be was: The new user defined aggregator feature (SPARK-27296) based on calling 'functions.udaf(aggregator)' works fine when the aggregator input type is atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, like 'Aggregator[Array[Double], _, _]', it is tripping over the following: {{ /** * When constructing [[MapObjects]], the element type must be given, which may not be available * before analysis. This class acts like a placeholder for [[MapObjects]], and will be replaced by * [[MapObjects]] during analysis after the input data is resolved. * Note that, ideally we should not serialize and send unresolved expressions to executors, but * users may accidentally do this(e.g. mistakenly reference an encoder instance when implementing * Aggregator). Here we mark `function` as transient because it may reference scala Type, which is * not serializable. Then even users mistakenly reference unresolved expression and serialize it, * it's just a performance issue(more network traffic), and will not fail. */ case class UnresolvedMapObjects( @transient function: Expression => Expression, child: Expression, customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with Unevaluable { override lazy val resolved = false override def dataType: DataType = customCollectionCls.map(ObjectType.apply).getOrElse { throw new UnsupportedOperationException("not resolved") } } }} The '@transient' is causing the function to be unpacked as 'null' over on the executors, and it is causing a null-pointer exception here, when it tries to do 'function(loopVar)' {{ object MapObjects { def apply( function: Expression => Expression, inputData: Expression, elementType: DataType, elementNullable: Boolean = true, customCollectionCls: Option[Class[_]] = None): MapObjects = { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) MapObjects(loopVar, function(loopVar), inputData, customCollectionCls) } } }} I believe it may be possible to just use 'loopVar' instead of 'function(loopVar)', whenever 'function' is null, but need second opinion from catalyst developers on what a robust fix should be > New udaf(Aggregator) has an integration bug with UnresolvedMapObjects > serialization > --- > > Key: SPARK-32159 > URL: https://issues.apache.org/jira/browse/SPARK-32159 > Project: Sp
[jira] [Commented] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization
[ https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150490#comment-17150490 ] Erik Erlandson commented on SPARK-32159: cc [~cloud_fan] > New udaf(Aggregator) has an integration bug with UnresolvedMapObjects > serialization > --- > > Key: SPARK-32159 > URL: https://issues.apache.org/jira/browse/SPARK-32159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Erik Erlandson >Priority: Major > > The new user defined aggregator feature (SPARK-27296) based on calling > 'functions.udaf(aggregator)' works fine when the aggregator input type is > atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an > array, like 'Aggregator[Array[Double], _, _]', it is tripping over the > following: > {{/** > * When constructing [[MapObjects]], the element type must be given, which > may not be available > * before analysis. This class acts like a placeholder for [[MapObjects]], > and will be replaced by > * [[MapObjects]] during analysis after the input data is resolved. > * Note that, ideally we should not serialize and send unresolved expressions > to executors, but > * users may accidentally do this(e.g. mistakenly reference an encoder > instance when implementing > * Aggregator). Here we mark `function` as transient because it may reference > scala Type, which is > * not serializable. Then even users mistakenly reference unresolved > expression and serialize it, > * it's just a performance issue(more network traffic), and will not fail. > */ > case class UnresolvedMapObjects( > @transient function: Expression => Expression, > child: Expression, > customCollectionCls: Option[Class[_]] = None) extends UnaryExpression > with Unevaluable { > override lazy val resolved = false > override def dataType: DataType = > customCollectionCls.map(ObjectType.apply).getOrElse { > throw new UnsupportedOperationException("not resolved") > } > }}} > The '@transient' is causing the function to be unpacked as 'null' over on the > executors, and it is causing a null-pointer exception here, when it tries to > do 'function(loopVar)' > {{object MapObjects { > def apply( > function: Expression => Expression, > inputData: Expression, > elementType: DataType, > elementNullable: Boolean = true, > customCollectionCls: Option[Class[_]] = None): MapObjects = { > val loopVar = LambdaVariable("MapObject", elementType, elementNullable) > MapObjects(loopVar, function(loopVar), inputData, customCollectionCls) > } > } > }} > I believe it may be possible to just use 'loopVar' instead of > 'function(loopVar)', whenever 'function' is null, but need second opinion > from catalyst developers on what a robust fix should be -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization
[ https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Erlandson updated SPARK-32159: --- Description: The new user defined aggregator feature (SPARK-27296) based on calling 'functions.udaf(aggregator)' works fine when the aggregator input type is atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, like 'Aggregator[Array[Double], _, _]', it is tripping over the following: /** * When constructing [[MapObjects]], the element type must be given, which may not be available * before analysis. This class acts like a placeholder for [[MapObjects]], and will be replaced by * [[MapObjects]] during analysis after the input data is resolved. * Note that, ideally we should not serialize and send unresolved expressions to executors, but * users may accidentally do this(e.g. mistakenly reference an encoder instance when implementing * Aggregator). Here we mark `function` as transient because it may reference scala Type, which is * not serializable. Then even users mistakenly reference unresolved expression and serialize it, * it's just a performance issue(more network traffic), and will not fail. */ case class UnresolvedMapObjects( {color:#de350b}@transient function: Expression => Expression{color}, child: Expression, customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with Unevaluable { override lazy val resolved = false override def dataType: DataType = customCollectionCls.map(ObjectType.apply).getOrElse { throw new UnsupportedOperationException("not resolved") } } *The '@transient' is causing the function to be unpacked as 'null' over on the executors, and it is causing a null-pointer exception here, when it tries to do 'function(loopVar)'* object MapObjects { def apply( function: Expression => Expression, inputData: Expression, elementType: DataType, elementNullable: Boolean = true, customCollectionCls: Option[Class[_]] = None): MapObjects = { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, customCollectionCls) } } *I believe it may be possible to just use 'loopVar' instead of 'function(loopVar)', whenever 'function' is null, but need second opinion from catalyst developers on what a robust fix should be* was: The new user defined aggregator feature (SPARK-27296) based on calling 'functions.udaf(aggregator)' works fine when the aggregator input type is atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, like 'Aggregator[Array[Double], _, _]', it is tripping over the following: /** * When constructing [[MapObjects]], the element type must be given, which may not be available * before analysis. This class acts like a placeholder for [[MapObjects]], and will be replaced by * [[MapObjects]] during analysis after the input data is resolved. * Note that, ideally we should not serialize and send unresolved expressions to executors, but * users may accidentally do this(e.g. mistakenly reference an encoder instance when implementing * Aggregator). Here we mark `function` as transient because it may reference scala Type, which is * not serializable. Then even users mistakenly reference unresolved expression and serialize it, * it's just a performance issue(more network traffic), and will not fail. */ case class UnresolvedMapObjects( {color:#de350b}@transient function: Expression => Expression{color}, child: Expression, customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with Unevaluable { override lazy val resolved = false override def dataType: DataType = customCollectionCls.map(ObjectType.apply).getOrElse { throw new UnsupportedOperationException("not resolved") } } *The '@transient' is causing the function to be unpacked as 'null' over on the executors, and it is causing a null-pointer exception here, when it tries to do 'function(loopVar)'* {{object MapObjects { def apply( function: Expression => Expression, inputData: Expression, elementType: DataType, elementNullable: Boolean = true, customCollectionCls: Option[Class[_]] = None): MapObjects = { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, customCollectionCls) } } }} *I believe it may be possible to just use 'loopVar' instead of 'function(loopVar)', whenever 'function' is null, but need second opinion from catalyst developers on what a robust fix should be* > New udaf(Aggregator) has an integration bug with UnresolvedMapObjects > serialization > --- > > Key: SPARK-32159 > URL: https://issues.apache.org/jira/browse/SPARK-32159 > Project: Spark >
[jira] [Updated] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization
[ https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Erlandson updated SPARK-32159: --- Description: The new user defined aggregator feature (SPARK-27296) based on calling 'functions.udaf(aggregator)' works fine when the aggregator input type is atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, like 'Aggregator[Array[Double], _, _]', it is tripping over the following: /** * When constructing [[MapObjects]], the element type must be given, which may not be available * before analysis. This class acts like a placeholder for [[MapObjects]], and will be replaced by * [[MapObjects]] during analysis after the input data is resolved. * Note that, ideally we should not serialize and send unresolved expressions to executors, but * users may accidentally do this(e.g. mistakenly reference an encoder instance when implementing * Aggregator). Here we mark `function` as transient because it may reference scala Type, which is * not serializable. Then even users mistakenly reference unresolved expression and serialize it, * it's just a performance issue(more network traffic), and will not fail. */ case class UnresolvedMapObjects( {color:#de350b}@transient function: Expression => Expression{color}, child: Expression, customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with Unevaluable { override lazy val resolved = false override def dataType: DataType = customCollectionCls.map(ObjectType.apply).getOrElse { throw new UnsupportedOperationException("not resolved") } } *The '@transient' is causing the function to be unpacked as 'null' over on the executors, and it is causing a null-pointer exception here, when it tries to do 'function(loopVar)'* {{object MapObjects { def apply( function: Expression => Expression, inputData: Expression, elementType: DataType, elementNullable: Boolean = true, customCollectionCls: Option[Class[_]] = None): MapObjects = { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, customCollectionCls) } } }} *I believe it may be possible to just use 'loopVar' instead of 'function(loopVar)', whenever 'function' is null, but need second opinion from catalyst developers on what a robust fix should be* was: The new user defined aggregator feature (SPARK-27296) based on calling 'functions.udaf(aggregator)' works fine when the aggregator input type is atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an array, like 'Aggregator[Array[Double], _, _]', it is tripping over the following: {{/** * When constructing [[MapObjects]], the element type must be given, which may not be available * before analysis. This class acts like a placeholder for [[MapObjects]], and will be replaced by * [[MapObjects]] during analysis after the input data is resolved. * Note that, ideally we should not serialize and send unresolved expressions to executors, but * users may accidentally do this(e.g. mistakenly reference an encoder instance when implementing * Aggregator). Here we mark `function` as transient because it may reference scala Type, which is * not serializable. Then even users mistakenly reference unresolved expression and serialize it, * it's just a performance issue(more network traffic), and will not fail. */ case class UnresolvedMapObjects( @transient function: Expression => Expression, child: Expression, customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with Unevaluable { override lazy val resolved = false override def dataType: DataType = customCollectionCls.map(ObjectType.apply).getOrElse { throw new UnsupportedOperationException("not resolved") } }}} The '@transient' is causing the function to be unpacked as 'null' over on the executors, and it is causing a null-pointer exception here, when it tries to do 'function(loopVar)' {{object MapObjects { def apply( function: Expression => Expression, inputData: Expression, elementType: DataType, elementNullable: Boolean = true, customCollectionCls: Option[Class[_]] = None): MapObjects = { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) MapObjects(loopVar, function(loopVar), inputData, customCollectionCls) } } }} I believe it may be possible to just use 'loopVar' instead of 'function(loopVar)', whenever 'function' is null, but need second opinion from catalyst developers on what a robust fix should be > New udaf(Aggregator) has an integration bug with UnresolvedMapObjects > serialization > --- > > Key: SPARK-32159 > URL: https://issues.apache.org/jira/browse/SPARK-32159 > Project: Spark
[jira] [Commented] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()
[ https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150523#comment-17150523 ] Sudharshann D. commented on SPARK-31579: Hey [~maxgekk]. friendly ping once again! > Replace floorDiv by / in localRebaseGregorianToJulianDays() > --- > > Key: SPARK-31579 > URL: https://issues.apache.org/jira/browse/SPARK-31579 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > Labels: starter > > Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check > that for all available time zones in the range of [0001, 2100] years with the > step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv > can be replaced by /, and this should improve performance of > RebaseDateTime.localRebaseGregorianToJulianDays. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()
[ https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150535#comment-17150535 ] Maxim Gekk commented on SPARK-31579: [~suddhuASF] Please, open a PR for master. > Replace floorDiv by / in localRebaseGregorianToJulianDays() > --- > > Key: SPARK-31579 > URL: https://issues.apache.org/jira/browse/SPARK-31579 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > Labels: starter > > Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check > that for all available time zones in the range of [0001, 2100] years with the > step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv > can be replaced by /, and this should improve performance of > RebaseDateTime.localRebaseGregorianToJulianDays. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150565#comment-17150565 ] Apache Spark commented on SPARK-32130: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28981 > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. >Reporter: Sanjeev Mishra >Assignee: Maxim Gekk >Priority: Critical > Fix For: 3.0.1, 3.1.0 > > Attachments: SPARK 32130 - replication and findings.ipynb, > small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers when compared to Spark > 2.4 > > Spark 2.4 > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: {color:#ff}19691 ms{color} > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res61.count()) > Time taken: {color:#ff}7113 ms{color} > res64: Long = 2605349 > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: {color:#ff}849652 ms{color} > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res0.count()) > Time taken: {color:#ff}8201 ms{color} > res2: Long = 2605349 > > > I am attaching a sample data (please delete is once you are able to > reproduce the issue) that is much smaller than the actual size but the > performance comparison can still be verified. > The sample tar contains bunch of json.gz files, each line of the file is self > contained json doc as shown below > To reproduce the issue please untar the attachment - it will have multiple > .json.gz files whose contents will look similar to following > > {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS > HNC IGD","Annex F > Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports > Arris FastPath Speed > Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS > HNC IGD > EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service > Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First > Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPer
[jira] [Commented] (SPARK-32130) Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-32130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150566#comment-17150566 ] Apache Spark commented on SPARK-32130: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28981 > Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4 > -- > > Key: SPARK-32130 > URL: https://issues.apache.org/jira/browse/SPARK-32130 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 > Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, > sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using > 10.0.0.8 instead (on interface en0) > 20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > 20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. > Attempting port 4041. > Spark context Web UI available at http://10.0.0.8:4041 > Spark context available as 'sc' (master = local[*], app id = > local-1593442346864). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java > 1.8.0_251) > Type in expressions to have them evaluated. > Type :help for more information. >Reporter: Sanjeev Mishra >Assignee: Maxim Gekk >Priority: Critical > Fix For: 3.0.1, 3.1.0 > > Attachments: SPARK 32130 - replication and findings.ipynb, > small-anon.tar > > > We are planning to move to Spark 3 but the read performance of our json files > is unacceptable. Following is the performance numbers when compared to Spark > 2.4 > > Spark 2.4 > scala> spark.time(spark.read.json("/data/20200528")) > Time taken: {color:#ff}19691 ms{color} > res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res61.count()) > Time taken: {color:#ff}7113 ms{color} > res64: Long = 2605349 > Spark 3.0 > scala> spark.time(spark.read.json("/data/20200528")) > 20/06/29 08:06:53 WARN package: Truncated the string representation of a > plan since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > Time taken: {color:#ff}849652 ms{color} > res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 > more fields] > scala> spark.time(res0.count()) > Time taken: {color:#ff}8201 ms{color} > res2: Long = 2605349 > > > I am attaching a sample data (please delete is once you are able to > reproduce the issue) that is much smaller than the actual size but the > performance comparison can still be verified. > The sample tar contains bunch of json.gz files, each line of the file is self > contained json doc as shown below > To reproduce the issue please untar the attachment - it will have multiple > .json.gz files whose contents will look similar to following > > {quote}{color:#ff}{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS > HNC IGD","Annex F > Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports > Arris FastPath Speed > Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS > HNC IGD > EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service > Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First > Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPer
[jira] [Assigned] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization
[ https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32159: Assignee: Apache Spark > New udaf(Aggregator) has an integration bug with UnresolvedMapObjects > serialization > --- > > Key: SPARK-32159 > URL: https://issues.apache.org/jira/browse/SPARK-32159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Erik Erlandson >Assignee: Apache Spark >Priority: Major > > The new user defined aggregator feature (SPARK-27296) based on calling > 'functions.udaf(aggregator)' works fine when the aggregator input type is > atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an > array, like 'Aggregator[Array[Double], _, _]', it is tripping over the > following: > /** > * When constructing [[MapObjects]], the element type must be given, which > may not be available > * before analysis. This class acts like a placeholder for [[MapObjects]], > and will be replaced by > * [[MapObjects]] during analysis after the input data is resolved. > * Note that, ideally we should not serialize and send unresolved expressions > to executors, but > * users may accidentally do this(e.g. mistakenly reference an encoder > instance when implementing > * Aggregator). Here we mark `function` as transient because it may reference > scala Type, which is > * not serializable. Then even users mistakenly reference unresolved > expression and serialize it, > * it's just a performance issue(more network traffic), and will not fail. > */ > case class UnresolvedMapObjects( > {color:#de350b}@transient function: Expression => Expression{color}, > child: Expression, > customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with > Unevaluable { > override lazy val resolved = false > override def dataType: DataType = > customCollectionCls.map(ObjectType.apply).getOrElse > { throw new UnsupportedOperationException("not resolved") } > } > > *The '@transient' is causing the function to be unpacked as 'null' over on > the executors, and it is causing a null-pointer exception here, when it tries > to do 'function(loopVar)'* > object MapObjects { > def apply( > function: Expression => Expression, > inputData: Expression, > elementType: DataType, > elementNullable: Boolean = true, > customCollectionCls: Option[Class[_]] = None): MapObjects = > { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) > MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, > customCollectionCls) } > } > *I believe it may be possible to just use 'loopVar' instead of > 'function(loopVar)', whenever 'function' is null, but need second opinion > from catalyst developers on what a robust fix should be* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization
[ https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150590#comment-17150590 ] Apache Spark commented on SPARK-32159: -- User 'erikerlandson' has created a pull request for this issue: https://github.com/apache/spark/pull/28983 > New udaf(Aggregator) has an integration bug with UnresolvedMapObjects > serialization > --- > > Key: SPARK-32159 > URL: https://issues.apache.org/jira/browse/SPARK-32159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Erik Erlandson >Priority: Major > > The new user defined aggregator feature (SPARK-27296) based on calling > 'functions.udaf(aggregator)' works fine when the aggregator input type is > atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an > array, like 'Aggregator[Array[Double], _, _]', it is tripping over the > following: > /** > * When constructing [[MapObjects]], the element type must be given, which > may not be available > * before analysis. This class acts like a placeholder for [[MapObjects]], > and will be replaced by > * [[MapObjects]] during analysis after the input data is resolved. > * Note that, ideally we should not serialize and send unresolved expressions > to executors, but > * users may accidentally do this(e.g. mistakenly reference an encoder > instance when implementing > * Aggregator). Here we mark `function` as transient because it may reference > scala Type, which is > * not serializable. Then even users mistakenly reference unresolved > expression and serialize it, > * it's just a performance issue(more network traffic), and will not fail. > */ > case class UnresolvedMapObjects( > {color:#de350b}@transient function: Expression => Expression{color}, > child: Expression, > customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with > Unevaluable { > override lazy val resolved = false > override def dataType: DataType = > customCollectionCls.map(ObjectType.apply).getOrElse > { throw new UnsupportedOperationException("not resolved") } > } > > *The '@transient' is causing the function to be unpacked as 'null' over on > the executors, and it is causing a null-pointer exception here, when it tries > to do 'function(loopVar)'* > object MapObjects { > def apply( > function: Expression => Expression, > inputData: Expression, > elementType: DataType, > elementNullable: Boolean = true, > customCollectionCls: Option[Class[_]] = None): MapObjects = > { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) > MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, > customCollectionCls) } > } > *I believe it may be possible to just use 'loopVar' instead of > 'function(loopVar)', whenever 'function' is null, but need second opinion > from catalyst developers on what a robust fix should be* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization
[ https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32159: Assignee: (was: Apache Spark) > New udaf(Aggregator) has an integration bug with UnresolvedMapObjects > serialization > --- > > Key: SPARK-32159 > URL: https://issues.apache.org/jira/browse/SPARK-32159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Erik Erlandson >Priority: Major > > The new user defined aggregator feature (SPARK-27296) based on calling > 'functions.udaf(aggregator)' works fine when the aggregator input type is > atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an > array, like 'Aggregator[Array[Double], _, _]', it is tripping over the > following: > /** > * When constructing [[MapObjects]], the element type must be given, which > may not be available > * before analysis. This class acts like a placeholder for [[MapObjects]], > and will be replaced by > * [[MapObjects]] during analysis after the input data is resolved. > * Note that, ideally we should not serialize and send unresolved expressions > to executors, but > * users may accidentally do this(e.g. mistakenly reference an encoder > instance when implementing > * Aggregator). Here we mark `function` as transient because it may reference > scala Type, which is > * not serializable. Then even users mistakenly reference unresolved > expression and serialize it, > * it's just a performance issue(more network traffic), and will not fail. > */ > case class UnresolvedMapObjects( > {color:#de350b}@transient function: Expression => Expression{color}, > child: Expression, > customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with > Unevaluable { > override lazy val resolved = false > override def dataType: DataType = > customCollectionCls.map(ObjectType.apply).getOrElse > { throw new UnsupportedOperationException("not resolved") } > } > > *The '@transient' is causing the function to be unpacked as 'null' over on > the executors, and it is causing a null-pointer exception here, when it tries > to do 'function(loopVar)'* > object MapObjects { > def apply( > function: Expression => Expression, > inputData: Expression, > elementType: DataType, > elementNullable: Boolean = true, > customCollectionCls: Option[Class[_]] = None): MapObjects = > { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) > MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, > customCollectionCls) } > } > *I believe it may be possible to just use 'loopVar' instead of > 'function(loopVar)', whenever 'function' is null, but need second opinion > from catalyst developers on what a robust fix should be* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization
[ https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150591#comment-17150591 ] Apache Spark commented on SPARK-32159: -- User 'erikerlandson' has created a pull request for this issue: https://github.com/apache/spark/pull/28983 > New udaf(Aggregator) has an integration bug with UnresolvedMapObjects > serialization > --- > > Key: SPARK-32159 > URL: https://issues.apache.org/jira/browse/SPARK-32159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Erik Erlandson >Priority: Major > > The new user defined aggregator feature (SPARK-27296) based on calling > 'functions.udaf(aggregator)' works fine when the aggregator input type is > atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an > array, like 'Aggregator[Array[Double], _, _]', it is tripping over the > following: > /** > * When constructing [[MapObjects]], the element type must be given, which > may not be available > * before analysis. This class acts like a placeholder for [[MapObjects]], > and will be replaced by > * [[MapObjects]] during analysis after the input data is resolved. > * Note that, ideally we should not serialize and send unresolved expressions > to executors, but > * users may accidentally do this(e.g. mistakenly reference an encoder > instance when implementing > * Aggregator). Here we mark `function` as transient because it may reference > scala Type, which is > * not serializable. Then even users mistakenly reference unresolved > expression and serialize it, > * it's just a performance issue(more network traffic), and will not fail. > */ > case class UnresolvedMapObjects( > {color:#de350b}@transient function: Expression => Expression{color}, > child: Expression, > customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with > Unevaluable { > override lazy val resolved = false > override def dataType: DataType = > customCollectionCls.map(ObjectType.apply).getOrElse > { throw new UnsupportedOperationException("not resolved") } > } > > *The '@transient' is causing the function to be unpacked as 'null' over on > the executors, and it is causing a null-pointer exception here, when it tries > to do 'function(loopVar)'* > object MapObjects { > def apply( > function: Expression => Expression, > inputData: Expression, > elementType: DataType, > elementNullable: Boolean = true, > customCollectionCls: Option[Class[_]] = None): MapObjects = > { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) > MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, > customCollectionCls) } > } > *I believe it may be possible to just use 'loopVar' instead of > 'function(loopVar)', whenever 'function' is null, but need second opinion > from catalyst developers on what a robust fix should be* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32159) New udaf(Aggregator) has an integration bug with UnresolvedMapObjects serialization
[ https://issues.apache.org/jira/browse/SPARK-32159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150593#comment-17150593 ] Dongjoon Hyun commented on SPARK-32159: --- Hi, [~eje]. Shall we set `Target Version` to `3.0.1`? > New udaf(Aggregator) has an integration bug with UnresolvedMapObjects > serialization > --- > > Key: SPARK-32159 > URL: https://issues.apache.org/jira/browse/SPARK-32159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Erik Erlandson >Priority: Major > > The new user defined aggregator feature (SPARK-27296) based on calling > 'functions.udaf(aggregator)' works fine when the aggregator input type is > atomic, e.g. 'Aggregator[Double, _, _]', however if the input type is an > array, like 'Aggregator[Array[Double], _, _]', it is tripping over the > following: > /** > * When constructing [[MapObjects]], the element type must be given, which > may not be available > * before analysis. This class acts like a placeholder for [[MapObjects]], > and will be replaced by > * [[MapObjects]] during analysis after the input data is resolved. > * Note that, ideally we should not serialize and send unresolved expressions > to executors, but > * users may accidentally do this(e.g. mistakenly reference an encoder > instance when implementing > * Aggregator). Here we mark `function` as transient because it may reference > scala Type, which is > * not serializable. Then even users mistakenly reference unresolved > expression and serialize it, > * it's just a performance issue(more network traffic), and will not fail. > */ > case class UnresolvedMapObjects( > {color:#de350b}@transient function: Expression => Expression{color}, > child: Expression, > customCollectionCls: Option[Class[_]] = None) extends UnaryExpression with > Unevaluable { > override lazy val resolved = false > override def dataType: DataType = > customCollectionCls.map(ObjectType.apply).getOrElse > { throw new UnsupportedOperationException("not resolved") } > } > > *The '@transient' is causing the function to be unpacked as 'null' over on > the executors, and it is causing a null-pointer exception here, when it tries > to do 'function(loopVar)'* > object MapObjects { > def apply( > function: Expression => Expression, > inputData: Expression, > elementType: DataType, > elementNullable: Boolean = true, > customCollectionCls: Option[Class[_]] = None): MapObjects = > { val loopVar = LambdaVariable("MapObject", elementType, elementNullable) > MapObjects(loopVar, {color:#de350b}function(loopVar){color}, inputData, > customCollectionCls) } > } > *I believe it may be possible to just use 'loopVar' instead of > 'function(loopVar)', whenever 'function' is null, but need second opinion > from catalyst developers on what a robust fix should be* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31666) Cannot map hostPath volumes to container
[ https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150596#comment-17150596 ] Dongjoon Hyun commented on SPARK-31666: --- Apache Spark 3.0.0 is released last month and Apache Spark 3.1.0 is scheduled on December 2020. - https://spark.apache.org/versioning-policy.html > Cannot map hostPath volumes to container > > > Key: SPARK-31666 > URL: https://issues.apache.org/jira/browse/SPARK-31666 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.5 >Reporter: Stephen Hopper >Priority: Major > > I'm trying to mount additional hostPath directories as seen in a couple of > places: > [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/] > [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space] > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > > However, whenever I try to submit my job, I run into this error: > {code:java} > Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │ > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. > Message: Pod "spark-pi-1588970477877-exec-1" is invalid: > spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be > unique. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath, > message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, > additionalProperties={})], group=null, kind=Pod, > name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=Pod > "spark-pi-1588970477877-exec-1" is invalid: > spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be > unique, metadata=ListMeta(_continue=null, remainingItemCount=null, > resourceVersion=null, selfLink=null, additionalProperties={}), > reason=Invalid, status=Failure, additionalProperties={}).{code} > > This is my spark-submit command (note: I've used my own build of spark for > kubernetes as well as a few other images that I've seen floating around (such > as this one seedjeffwan/spark:v2.4.5) and they all have this same issue): > {code:java} > bin/spark-submit \ > --master k8s://https://my-k8s-server:443 \ > --deploy-mode cluster \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=2 \ > --conf spark.kubernetes.container.image=my-spark-image:my-tag \ > --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \ > --conf spark.kubernetes.namespace=my-spark-ns \ > --conf > spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 > \ > --conf > spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1 > \ > --conf spark.local.dir="/tmp1" \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark > local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code} > Any ideas on what's causing this? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32158) Add JSONOptions to toJSON
[ https://issues.apache.org/jira/browse/SPARK-32158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150598#comment-17150598 ] Apache Spark commented on SPARK-32158: -- User 'Gschiavon' has created a pull request for this issue: https://github.com/apache/spark/pull/28984 > Add JSONOptions to toJSON > - > > Key: SPARK-32158 > URL: https://issues.apache.org/jira/browse/SPARK-32158 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: German Schiavon Matteo >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > Actually when calling `toJSON` on a dataFrame with null values, it doesn't > print them. > Basically the same idea than > https://issues.apache.org/jira/browse/SPARK-23772. > > {code:java} > val df = spark.sparkContext.parallelize(Seq("1", "2", null)).toDF("col1") > df.toJSON -> {"col1":"1"},{"col1":"2"},{}{code} > > After the PR: > {code:java} > val result = df.toJSON(Map("ignoreNullFields" -> > "false")).collect().mkString(",") > val expected = """{"col1":"1"},{"col1":"2"},{"col1":null}""" > {code} > [~maropu] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32158) Add JSONOptions to toJSON
[ https://issues.apache.org/jira/browse/SPARK-32158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32158: Assignee: Apache Spark > Add JSONOptions to toJSON > - > > Key: SPARK-32158 > URL: https://issues.apache.org/jira/browse/SPARK-32158 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: German Schiavon Matteo >Assignee: Apache Spark >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > Actually when calling `toJSON` on a dataFrame with null values, it doesn't > print them. > Basically the same idea than > https://issues.apache.org/jira/browse/SPARK-23772. > > {code:java} > val df = spark.sparkContext.parallelize(Seq("1", "2", null)).toDF("col1") > df.toJSON -> {"col1":"1"},{"col1":"2"},{}{code} > > After the PR: > {code:java} > val result = df.toJSON(Map("ignoreNullFields" -> > "false")).collect().mkString(",") > val expected = """{"col1":"1"},{"col1":"2"},{"col1":null}""" > {code} > [~maropu] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32158) Add JSONOptions to toJSON
[ https://issues.apache.org/jira/browse/SPARK-32158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32158: Assignee: (was: Apache Spark) > Add JSONOptions to toJSON > - > > Key: SPARK-32158 > URL: https://issues.apache.org/jira/browse/SPARK-32158 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: German Schiavon Matteo >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > Actually when calling `toJSON` on a dataFrame with null values, it doesn't > print them. > Basically the same idea than > https://issues.apache.org/jira/browse/SPARK-23772. > > {code:java} > val df = spark.sparkContext.parallelize(Seq("1", "2", null)).toDF("col1") > df.toJSON -> {"col1":"1"},{"col1":"2"},{}{code} > > After the PR: > {code:java} > val result = df.toJSON(Map("ignoreNullFields" -> > "false")).collect().mkString(",") > val expected = """{"col1":"1"},{"col1":"2"},{"col1":null}""" > {code} > [~maropu] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32158) Add JSONOptions to toJSON
[ https://issues.apache.org/jira/browse/SPARK-32158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] German Schiavon Matteo updated SPARK-32158: --- Description: Actually when calling `toJSON` on a dataFrame with null values, it doesn't print them. Basically the same idea than https://issues.apache.org/jira/browse/SPARK-23772. {code:java} val df = spark.sparkContext.parallelize(Seq("1", "2", null)).toDF("col1") df.toJSON -> {"col1":"1"},{"col1":"2"},{}{code} After the PR: {code:java} val result = df.toJSON(Map("ignoreNullFields" -> "false")).collect().mkString(",") val expected = """{"col1":"1"},{"col1":"2"},{"col1":null}""" {code} [~maropu] [~ueshin] [https://github.com/apache/spark/pull/28984/] was: Actually when calling `toJSON` on a dataFrame with null values, it doesn't print them. Basically the same idea than https://issues.apache.org/jira/browse/SPARK-23772. {code:java} val df = spark.sparkContext.parallelize(Seq("1", "2", null)).toDF("col1") df.toJSON -> {"col1":"1"},{"col1":"2"},{}{code} After the PR: {code:java} val result = df.toJSON(Map("ignoreNullFields" -> "false")).collect().mkString(",") val expected = """{"col1":"1"},{"col1":"2"},{"col1":null}""" {code} [~maropu] > Add JSONOptions to toJSON > - > > Key: SPARK-32158 > URL: https://issues.apache.org/jira/browse/SPARK-32158 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: German Schiavon Matteo >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > Actually when calling `toJSON` on a dataFrame with null values, it doesn't > print them. > Basically the same idea than > https://issues.apache.org/jira/browse/SPARK-23772. > > {code:java} > val df = spark.sparkContext.parallelize(Seq("1", "2", null)).toDF("col1") > df.toJSON -> {"col1":"1"},{"col1":"2"},{}{code} > > After the PR: > {code:java} > val result = df.toJSON(Map("ignoreNullFields" -> > "false")).collect().mkString(",") > val expected = """{"col1":"1"},{"col1":"2"},{"col1":null}""" > {code} > [~maropu] [~ueshin] > > [https://github.com/apache/spark/pull/28984/] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31666) Cannot map hostPath volumes to container
[ https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150599#comment-17150599 ] Dongjoon Hyun commented on SPARK-31666: --- FYI, Apache Spark 2.4.0 was released at November 2, 2018. It's already over 18 months. Apache Spark community wants to service the users a little longer with critical fixes like security and correctness issues. As a result, Apache Spark 2.4.7 will be released soon again. {quote}Feature release branches will, generally, be maintained with bug fix releases for a period of 18 months {quote} > Cannot map hostPath volumes to container > > > Key: SPARK-31666 > URL: https://issues.apache.org/jira/browse/SPARK-31666 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.5 >Reporter: Stephen Hopper >Priority: Major > > I'm trying to mount additional hostPath directories as seen in a couple of > places: > [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/] > [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space] > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > > However, whenever I try to submit my job, I run into this error: > {code:java} > Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │ > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. > Message: Pod "spark-pi-1588970477877-exec-1" is invalid: > spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be > unique. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath, > message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, > additionalProperties={})], group=null, kind=Pod, > name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=Pod > "spark-pi-1588970477877-exec-1" is invalid: > spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be > unique, metadata=ListMeta(_continue=null, remainingItemCount=null, > resourceVersion=null, selfLink=null, additionalProperties={}), > reason=Invalid, status=Failure, additionalProperties={}).{code} > > This is my spark-submit command (note: I've used my own build of spark for > kubernetes as well as a few other images that I've seen floating around (such > as this one seedjeffwan/spark:v2.4.5) and they all have this same issue): > {code:java} > bin/spark-submit \ > --master k8s://https://my-k8s-server:443 \ > --deploy-mode cluster \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=2 \ > --conf spark.kubernetes.container.image=my-spark-image:my-tag \ > --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \ > --conf spark.kubernetes.namespace=my-spark-ns \ > --conf > spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 > \ > --conf > spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1 > \ > --conf spark.local.dir="/tmp1" \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark > local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code} > Any ideas on what's causing this? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31666) Cannot map hostPath volumes to container
[ https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150601#comment-17150601 ] Dongjoon Hyun commented on SPARK-31666: --- I linked SPARK-23529 since `hostPath` is added there. > Cannot map hostPath volumes to container > > > Key: SPARK-31666 > URL: https://issues.apache.org/jira/browse/SPARK-31666 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.5 >Reporter: Stephen Hopper >Priority: Major > > I'm trying to mount additional hostPath directories as seen in a couple of > places: > [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/] > [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space] > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > > However, whenever I try to submit my job, I run into this error: > {code:java} > Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │ > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. > Message: Pod "spark-pi-1588970477877-exec-1" is invalid: > spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be > unique. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath, > message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, > additionalProperties={})], group=null, kind=Pod, > name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=Pod > "spark-pi-1588970477877-exec-1" is invalid: > spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be > unique, metadata=ListMeta(_continue=null, remainingItemCount=null, > resourceVersion=null, selfLink=null, additionalProperties={}), > reason=Invalid, status=Failure, additionalProperties={}).{code} > > This is my spark-submit command (note: I've used my own build of spark for > kubernetes as well as a few other images that I've seen floating around (such > as this one seedjeffwan/spark:v2.4.5) and they all have this same issue): > {code:java} > bin/spark-submit \ > --master k8s://https://my-k8s-server:443 \ > --deploy-mode cluster \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=2 \ > --conf spark.kubernetes.container.image=my-spark-image:my-tag \ > --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \ > --conf spark.kubernetes.namespace=my-spark-ns \ > --conf > spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 > \ > --conf > spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1 > \ > --conf spark.local.dir="/tmp1" \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark > local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code} > Any ideas on what's causing this? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31666) Cannot map hostPath volumes to container
[ https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150601#comment-17150601 ] Dongjoon Hyun edited comment on SPARK-31666 at 7/2/20, 9:50 PM: I linked SPARK-23529 since `hostPath` is added there at 2.4.0. was (Author: dongjoon): I linked SPARK-23529 since `hostPath` is added there. > Cannot map hostPath volumes to container > > > Key: SPARK-31666 > URL: https://issues.apache.org/jira/browse/SPARK-31666 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.5 >Reporter: Stephen Hopper >Priority: Major > > I'm trying to mount additional hostPath directories as seen in a couple of > places: > [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/] > [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space] > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > > However, whenever I try to submit my job, I run into this error: > {code:java} > Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │ > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. > Message: Pod "spark-pi-1588970477877-exec-1" is invalid: > spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be > unique. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath, > message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, > additionalProperties={})], group=null, kind=Pod, > name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=Pod > "spark-pi-1588970477877-exec-1" is invalid: > spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be > unique, metadata=ListMeta(_continue=null, remainingItemCount=null, > resourceVersion=null, selfLink=null, additionalProperties={}), > reason=Invalid, status=Failure, additionalProperties={}).{code} > > This is my spark-submit command (note: I've used my own build of spark for > kubernetes as well as a few other images that I've seen floating around (such > as this one seedjeffwan/spark:v2.4.5) and they all have this same issue): > {code:java} > bin/spark-submit \ > --master k8s://https://my-k8s-server:443 \ > --deploy-mode cluster \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=2 \ > --conf spark.kubernetes.container.image=my-spark-image:my-tag \ > --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \ > --conf spark.kubernetes.namespace=my-spark-ns \ > --conf > spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 > \ > --conf > spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1 > \ > --conf spark.local.dir="/tmp1" \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark > local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code} > Any ideas on what's causing this? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25262) Support tmpfs for local dirs in k8s
[ https://issues.apache.org/jira/browse/SPARK-25262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150605#comment-17150605 ] Apache Spark commented on SPARK-25262: -- User 'hopper-signifyd' has created a pull request for this issue: https://github.com/apache/spark/pull/28985 > Support tmpfs for local dirs in k8s > --- > > Key: SPARK-25262 > URL: https://issues.apache.org/jira/browse/SPARK-25262 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 2.3.0, 2.3.1 >Reporter: Rob Vesse >Assignee: Rob Vesse >Priority: Major > Fix For: 3.0.0 > > > As discussed during review of the design document for SPARK-24434 while > providing pod templates will provide more in-depth customisation for Spark on > Kubernetes there are some things that cannot be modified because Spark code > generates pod specs in very specific ways. > The particular issue identified relates to handling on {{spark.local.dirs}} > which is done by {{LocalDirsFeatureStep.scala}}. For each directory > specified, or a single default if no explicit specification, it creates a > Kubernetes {{emptyDir}} volume. As noted in the Kubernetes documentation > this will be backed by the node storage > (https://kubernetes.io/docs/concepts/storage/volumes/#emptydir). In some > compute environments this may be extremely undesirable. For example with > diskless compute resources the node storage will likely be a non-performant > remote mounted disk, often with limited capacity. For such environments it > would likely be better to set {{medium: Memory}} on the volume per the K8S > documentation to use a {{tmpfs}} volume instead. > Another closely related issue is that users might want to use a different > volume type to back the local directories and there is no possibility to do > that. > Pod templates will not really solve either of these issues because Spark is > always going to attempt to generate a new volume for each local directory and > always going to set these as {{emptyDir}}. > Therefore the proposal is to make two changes to {{LocalDirsFeatureStep}}: > * Provide a new config setting to enable using {{tmpfs}} backed {{emptyDir}} > volumes > * Modify the logic to check if there is a volume already defined with the > name and if so skip generating a volume definition for it -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31666) Cannot map hostPath volumes to container
[ https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150621#comment-17150621 ] Dongjoon Hyun commented on SPARK-31666: --- Hi, [~hopper-signifyd]. I found what is going on there. SPARK-23529 works correctly like the following. {code} # minikube ssh ls /data SPARK-31666.txt {code} {code} export HTTP2_DISABLE=true bin/spark-submit \ --master k8s://$K8S_MASTER \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.kubernetes.driverEnv.HTTP2_DISABLE=true \ --conf spark.executor.instances=1 \ --conf spark.kubernetes.container.image=spark/spark:v2.4.6 \ --conf spark.kubernetes.executor.volumes.hostPath.data.mount.path=/data \ --conf spark.kubernetes.executor.volumes.hostPath.data.options.path=/data \ local:///opt/spark/examples/jars/spark-examples_2.11-2.4.6.jar 1 {code} {code} # k exec po/spark-pi-1593729363998-exec-1 -- ls /data SPARK-31666.txt {code} Please see the error message `Invalid value: "/tmp1": must be unique.`. The error message occurs because `spark-local-dir-x` is already mounted as volume name by Spark. You should not use the same name. {code} 20/07/02 15:38:39 INFO LoggingPodStatusWatcherImpl: State changed, new state: pod name: spark-pi-1593729518015-driver namespace: default labels: spark-app-selector -> spark-74b65a9a61cc46fd8bfc5e03e4b28bb8, spark-role -> driver pod uid: d838532b-eaa9-4b11-8eba-655f66965580 creation time: 2020-07-02T22:38:39Z service account name: default volumes: spark-local-dir-1, spark-conf-volume, default-token-n5wwg node name: N/A start time: N/A container images: N/A phase: Pending status: [] {code} > Cannot map hostPath volumes to container > > > Key: SPARK-31666 > URL: https://issues.apache.org/jira/browse/SPARK-31666 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.5 >Reporter: Stephen Hopper >Priority: Major > > I'm trying to mount additional hostPath directories as seen in a couple of > places: > [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/] > [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space] > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > > However, whenever I try to submit my job, I run into this error: > {code:java} > Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │ > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. > Message: Pod "spark-pi-1588970477877-exec-1" is invalid: > spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be > unique. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath, > message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, > additionalProperties={})], group=null, kind=Pod, > name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=Pod > "spark-pi-1588970477877-exec-1" is invalid: > spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be > unique, metadata=ListMeta(_continue=null, remainingItemCount=null, > resourceVersion=null, selfLink=null, additionalProperties={}), > reason=Invalid, status=Failure, additionalProperties={}).{code} > > This is my spark-submit command (note: I've used my own build of spark for > kubernetes as well as a few other images that I've seen floating around (such > as this one seedjeffwan/spark:v2.4.5) and they all have this same issue): > {code:java} > bin/spark-submit \ > --master k8s://https://my-k8s-server:443 \ > --deploy-mode cluster \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=2 \ > --conf spark.kubernetes.container.image=my-spark-image:my-tag \ > --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \ > --conf spark.kubernetes.namespace=my-spark-ns \ > --conf > spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 > \ > --conf > spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1 > \ > --conf spark.local.dir="/tmp1" \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark > local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code} > Any ideas on what's causing this? > -- This message w
[jira] [Commented] (SPARK-31666) Cannot map hostPath volumes to container
[ https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150622#comment-17150622 ] Dongjoon Hyun commented on SPARK-31666: --- So, "Cannot map hostPath volumes to container" is a wrong claim. It's a fair warning from K8s to prevent duplicated volume names. I'll close this issue. > Cannot map hostPath volumes to container > > > Key: SPARK-31666 > URL: https://issues.apache.org/jira/browse/SPARK-31666 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.5 >Reporter: Stephen Hopper >Priority: Major > > I'm trying to mount additional hostPath directories as seen in a couple of > places: > [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/] > [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space] > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > > However, whenever I try to submit my job, I run into this error: > {code:java} > Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │ > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. > Message: Pod "spark-pi-1588970477877-exec-1" is invalid: > spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be > unique. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath, > message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, > additionalProperties={})], group=null, kind=Pod, > name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=Pod > "spark-pi-1588970477877-exec-1" is invalid: > spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be > unique, metadata=ListMeta(_continue=null, remainingItemCount=null, > resourceVersion=null, selfLink=null, additionalProperties={}), > reason=Invalid, status=Failure, additionalProperties={}).{code} > > This is my spark-submit command (note: I've used my own build of spark for > kubernetes as well as a few other images that I've seen floating around (such > as this one seedjeffwan/spark:v2.4.5) and they all have this same issue): > {code:java} > bin/spark-submit \ > --master k8s://https://my-k8s-server:443 \ > --deploy-mode cluster \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=2 \ > --conf spark.kubernetes.container.image=my-spark-image:my-tag \ > --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \ > --conf spark.kubernetes.namespace=my-spark-ns \ > --conf > spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 > \ > --conf > spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1 > \ > --conf spark.local.dir="/tmp1" \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark > local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code} > Any ideas on what's causing this? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31666) Cannot map hostPath volumes to container
[ https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31666. --- Resolution: Not A Problem > Cannot map hostPath volumes to container > > > Key: SPARK-31666 > URL: https://issues.apache.org/jira/browse/SPARK-31666 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.5 >Reporter: Stephen Hopper >Priority: Major > > I'm trying to mount additional hostPath directories as seen in a couple of > places: > [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/] > [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space] > [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes] > > However, whenever I try to submit my job, I run into this error: > {code:java} > Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │ > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. > Message: Pod "spark-pi-1588970477877-exec-1" is invalid: > spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be > unique. Received status: Status(apiVersion=v1, code=422, > details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath, > message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, > additionalProperties={})], group=null, kind=Pod, > name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=Pod > "spark-pi-1588970477877-exec-1" is invalid: > spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be > unique, metadata=ListMeta(_continue=null, remainingItemCount=null, > resourceVersion=null, selfLink=null, additionalProperties={}), > reason=Invalid, status=Failure, additionalProperties={}).{code} > > This is my spark-submit command (note: I've used my own build of spark for > kubernetes as well as a few other images that I've seen floating around (such > as this one seedjeffwan/spark:v2.4.5) and they all have this same issue): > {code:java} > bin/spark-submit \ > --master k8s://https://my-k8s-server:443 \ > --deploy-mode cluster \ > --name spark-pi \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.executor.instances=2 \ > --conf spark.kubernetes.container.image=my-spark-image:my-tag \ > --conf spark.kubernetes.driver.pod.name=sparkpi-test-driver \ > --conf spark.kubernetes.namespace=my-spark-ns \ > --conf > spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.mount.path=/tmp1 > \ > --conf > spark.kubernetes.executor.volumes.hostPath.spark-local-dir-2.options.path=/tmp1 > \ > --conf spark.local.dir="/tmp1" \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark > local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 2{code} > Any ideas on what's causing this? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32160) Executors should not be able to create SparkContext.
Takuya Ueshin created SPARK-32160: - Summary: Executors should not be able to create SparkContext. Key: SPARK-32160 URL: https://issues.apache.org/jira/browse/SPARK-32160 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin Currently executors can create SparkContext, but shouldn't be able to create it. {code:scala} sc.range(0, 1).foreach { _ => new SparkContext(new SparkConf().setAppName("test").setMaster("local")) } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32160) Executors should not be able to create SparkContext.
[ https://issues.apache.org/jira/browse/SPARK-32160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32160: Assignee: Apache Spark > Executors should not be able to create SparkContext. > > > Key: SPARK-32160 > URL: https://issues.apache.org/jira/browse/SPARK-32160 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > Currently executors can create SparkContext, but shouldn't be able to create > it. > {code:scala} > sc.range(0, 1).foreach { _ => > new SparkContext(new SparkConf().setAppName("test").setMaster("local")) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32160) Executors should not be able to create SparkContext.
[ https://issues.apache.org/jira/browse/SPARK-32160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32160: Assignee: (was: Apache Spark) > Executors should not be able to create SparkContext. > > > Key: SPARK-32160 > URL: https://issues.apache.org/jira/browse/SPARK-32160 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Priority: Major > > Currently executors can create SparkContext, but shouldn't be able to create > it. > {code:scala} > sc.range(0, 1).foreach { _ => > new SparkContext(new SparkConf().setAppName("test").setMaster("local")) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32160) Executors should not be able to create SparkContext.
[ https://issues.apache.org/jira/browse/SPARK-32160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150633#comment-17150633 ] Apache Spark commented on SPARK-32160: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/28986 > Executors should not be able to create SparkContext. > > > Key: SPARK-32160 > URL: https://issues.apache.org/jira/browse/SPARK-32160 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Priority: Major > > Currently executors can create SparkContext, but shouldn't be able to create > it. > {code:scala} > sc.range(0, 1).foreach { _ => > new SparkContext(new SparkConf().setAppName("test").setMaster("local")) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32160) Executors should not be able to create SparkContext.
[ https://issues.apache.org/jira/browse/SPARK-32160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150634#comment-17150634 ] Apache Spark commented on SPARK-32160: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/28986 > Executors should not be able to create SparkContext. > > > Key: SPARK-32160 > URL: https://issues.apache.org/jira/browse/SPARK-32160 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Priority: Major > > Currently executors can create SparkContext, but shouldn't be able to create > it. > {code:scala} > sc.range(0, 1).foreach { _ => > new SparkContext(new SparkConf().setAppName("test").setMaster("local")) > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32161) Hide JVM traceback for SparkUpgradeException
Hyukjin Kwon created SPARK-32161: Summary: Hide JVM traceback for SparkUpgradeException Key: SPARK-32161 URL: https://issues.apache.org/jira/browse/SPARK-32161 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.1.0 Reporter: Hyukjin Kwon We added {{SparkUpgradeException}} which the JVM traceback is pretty useless. See also https://github.com/apache/spark/pull/28736/files#r449184881 It should better also whitelist and hide JVM traceback. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32162) Improve Pandas Grouped Map with Window test output
Bryan Cutler created SPARK-32162: Summary: Improve Pandas Grouped Map with Window test output Key: SPARK-32162 URL: https://issues.apache.org/jira/browse/SPARK-32162 Project: Spark Issue Type: Improvement Components: PySpark, Tests Affects Versions: 3.0.0 Reporter: Bryan Cutler The output of GroupedMapInPandasTests.test_grouped_over_window_with_key is not helpful, only gives {code} == FAIL: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests) -- Traceback (most recent call last): File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 588, in test_grouped_over_window_with_key self.assertTrue(all([r[0] for r in result])) AssertionError: False is not true -- Ran 21 tests in 141.194s FAILED (failures=1) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32162) Improve Pandas Grouped Map with Window test output
[ https://issues.apache.org/jira/browse/SPARK-32162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32162: Assignee: (was: Apache Spark) > Improve Pandas Grouped Map with Window test output > -- > > Key: SPARK-32162 > URL: https://issues.apache.org/jira/browse/SPARK-32162 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Minor > > The output of GroupedMapInPandasTests.test_grouped_over_window_with_key is > not helpful, only gives > {code} > == > FAIL: test_grouped_over_window_with_key > (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests) > -- > Traceback (most recent call last): > File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line > 588, in test_grouped_over_window_with_key > self.assertTrue(all([r[0] for r in result])) > AssertionError: False is not true > -- > Ran 21 tests in 141.194s > FAILED (failures=1) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32162) Improve Pandas Grouped Map with Window test output
[ https://issues.apache.org/jira/browse/SPARK-32162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32162: Assignee: Apache Spark > Improve Pandas Grouped Map with Window test output > -- > > Key: SPARK-32162 > URL: https://issues.apache.org/jira/browse/SPARK-32162 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Minor > > The output of GroupedMapInPandasTests.test_grouped_over_window_with_key is > not helpful, only gives > {code} > == > FAIL: test_grouped_over_window_with_key > (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests) > -- > Traceback (most recent call last): > File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line > 588, in test_grouped_over_window_with_key > self.assertTrue(all([r[0] for r in result])) > AssertionError: False is not true > -- > Ran 21 tests in 141.194s > FAILED (failures=1) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32162) Improve Pandas Grouped Map with Window test output
[ https://issues.apache.org/jira/browse/SPARK-32162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150673#comment-17150673 ] Apache Spark commented on SPARK-32162: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/28987 > Improve Pandas Grouped Map with Window test output > -- > > Key: SPARK-32162 > URL: https://issues.apache.org/jira/browse/SPARK-32162 > Project: Spark > Issue Type: Improvement > Components: PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Minor > > The output of GroupedMapInPandasTests.test_grouped_over_window_with_key is > not helpful, only gives > {code} > == > FAIL: test_grouped_over_window_with_key > (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests) > -- > Traceback (most recent call last): > File "/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line > 588, in test_grouped_over_window_with_key > self.assertTrue(all([r[0] for r in result])) > AssertionError: False is not true > -- > Ran 21 tests in 141.194s > FAILED (failures=1) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations
L. C. Hsieh created SPARK-32163: --- Summary: Nested pruning should still work for nested column extractors of attributes with cosmetic variations Key: SPARK-32163 URL: https://issues.apache.org/jira/browse/SPARK-32163 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh If the expressions extracting nested fields have cosmetic variations like qualifier difference, currently nested column pruning cannot work well. For example, two attributes which are semantically the same, are referred in a query, but the nested column extractors of them are treated differently when we deal with nested column pruning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations
[ https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150678#comment-17150678 ] Apache Spark commented on SPARK-32163: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/28988 > Nested pruning should still work for nested column extractors of attributes > with cosmetic variations > > > Key: SPARK-32163 > URL: https://issues.apache.org/jira/browse/SPARK-32163 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > If the expressions extracting nested fields have cosmetic variations like > qualifier difference, currently nested column pruning cannot work well. > For example, two attributes which are semantically the same, are referred in > a query, but the nested column extractors of them are treated differently > when we deal with nested column pruning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations
[ https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32163: Assignee: L. C. Hsieh (was: Apache Spark) > Nested pruning should still work for nested column extractors of attributes > with cosmetic variations > > > Key: SPARK-32163 > URL: https://issues.apache.org/jira/browse/SPARK-32163 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > If the expressions extracting nested fields have cosmetic variations like > qualifier difference, currently nested column pruning cannot work well. > For example, two attributes which are semantically the same, are referred in > a query, but the nested column extractors of them are treated differently > when we deal with nested column pruning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations
[ https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32163: Assignee: Apache Spark (was: L. C. Hsieh) > Nested pruning should still work for nested column extractors of attributes > with cosmetic variations > > > Key: SPARK-32163 > URL: https://issues.apache.org/jira/browse/SPARK-32163 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > If the expressions extracting nested fields have cosmetic variations like > qualifier difference, currently nested column pruning cannot work well. > For example, two attributes which are semantically the same, are referred in > a query, but the nested column extractors of them are treated differently > when we deal with nested column pruning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32163) Nested pruning should still work for nested column extractors of attributes with cosmetic variations
[ https://issues.apache.org/jira/browse/SPARK-32163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-32163: Issue Type: Bug (was: Improvement) > Nested pruning should still work for nested column extractors of attributes > with cosmetic variations > > > Key: SPARK-32163 > URL: https://issues.apache.org/jira/browse/SPARK-32163 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > If the expressions extracting nested fields have cosmetic variations like > qualifier difference, currently nested column pruning cannot work well. > For example, two attributes which are semantically the same, are referred in > a query, but the nested column extractors of them are treated differently > when we deal with nested column pruning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files
[ https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150688#comment-17150688 ] Apache Spark commented on SPARK-27194: -- User 'turboFei' has created a pull request for this issue: https://github.com/apache/spark/pull/28989 > Job failures when task attempts do not clean up spark-staging parquet files > --- > > Key: SPARK-27194 > URL: https://issues.apache.org/jira/browse/SPARK-27194 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.1, 2.3.2, 2.3.3 >Reporter: Reza Safi >Priority: Major > > When a container fails for some reason (for example when killed by yarn for > exceeding memory limits), the subsequent task attempts for the tasks that > were running on that container all fail with a FileAlreadyExistsException. > The original task attempt does not seem to successfully call abortTask (or at > least its "best effort" delete is unsuccessful) and clean up the parquet file > it was writing to, so when later task attempts try to write to the same > spark-staging directory using the same file name, the job fails. > Here is what transpires in the logs: > The container where task 200.0 is running is killed and the task is lost: > {code} > 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on > t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 > GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. > 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage > 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 > exited caused by one of the running tasks) Reason: Container killed by YARN > for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > {code} > The task is re-attempted on a different executor and fails because the > part-00200-blah-blah.c000.snappy.parquet file from the first task attempt > already exists: > {code} > 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 > (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet > for client a.b.c.d already exists > {code} > The job fails when the the configured task attempts (spark.task.maxFailures) > have failed with the same error: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 > in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage > 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > ... > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet > for client i.p.a.d already exists > {code} > SPARK-26682 wasn't the root cause here, since there wasn't any stage > reattempt. > This issue seems to happen when > spark.sql.sources.partitionOverwriteMode=dynamic. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150689#comment-17150689 ] Apache Spark commented on SPARK-29302: -- User 'turboFei' has created a pull request for this issue: https://github.com/apache/spark/pull/28989 > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27194) Job failures when task attempts do not clean up spark-staging parquet files
[ https://issues.apache.org/jira/browse/SPARK-27194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150690#comment-17150690 ] Apache Spark commented on SPARK-27194: -- User 'turboFei' has created a pull request for this issue: https://github.com/apache/spark/pull/28989 > Job failures when task attempts do not clean up spark-staging parquet files > --- > > Key: SPARK-27194 > URL: https://issues.apache.org/jira/browse/SPARK-27194 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.1, 2.3.2, 2.3.3 >Reporter: Reza Safi >Priority: Major > > When a container fails for some reason (for example when killed by yarn for > exceeding memory limits), the subsequent task attempts for the tasks that > were running on that container all fail with a FileAlreadyExistsException. > The original task attempt does not seem to successfully call abortTask (or at > least its "best effort" delete is unsuccessful) and clean up the parquet file > it was writing to, so when later task attempts try to write to the same > spark-staging directory using the same file name, the job fails. > Here is what transpires in the logs: > The container where task 200.0 is running is killed and the task is lost: > {code} > 19/02/20 09:33:25 ERROR cluster.YarnClusterScheduler: Lost executor y on > t.y.z.com: Container killed by YARN for exceeding memory limits. 8.1 GB of 8 > GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. > 19/02/20 09:33:25 WARN scheduler.TaskSetManager: Lost task 200.0 in stage > 0.0 (TID xxx, t.y.z.com, executor 93): ExecutorLostFailure (executor 93 > exited caused by one of the running tasks) Reason: Container killed by YARN > for exceeding memory limits. 8.1 GB of 8 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > {code} > The task is re-attempted on a different executor and fails because the > part-00200-blah-blah.c000.snappy.parquet file from the first task attempt > already exists: > {code} > 19/02/20 09:35:01 WARN scheduler.TaskSetManager: Lost task 200.1 in stage 0.0 > (TID 594, tn.y.z.com, executor 70): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet > for client a.b.c.d already exists > {code} > The job fails when the the configured task attempts (spark.task.maxFailures) > have failed with the same error: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 200 > in stage 0.0 failed 20 times, most recent failure: Lost task 284.19 in stage > 0.0 (TID yyy, tm.y.z.com, executor 16): org.apache.spark.SparkException: Task > failed while writing rows. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285) > ... > Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: > /user/hive/warehouse/tmp_supply_feb1/.spark-staging-blah-blah-blah/dt=2019-02-17/part-00200-blah-blah.c000.snappy.parquet > for client i.p.a.d already exists > {code} > SPARK-26682 wasn't the root cause here, since there wasn't any stage > reattempt. > This issue seems to happen when > spark.sql.sources.partitionOverwriteMode=dynamic. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32164) GeneralizedLinearRegressionSummary optimization
zhengruifeng created SPARK-32164: Summary: GeneralizedLinearRegressionSummary optimization Key: SPARK-32164 URL: https://issues.apache.org/jira/browse/SPARK-32164 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.1.0 Reporter: zhengruifeng compute several statistics on single pass -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32164) GeneralizedLinearRegressionSummary optimization
[ https://issues.apache.org/jira/browse/SPARK-32164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32164: Assignee: (was: Apache Spark) > GeneralizedLinearRegressionSummary optimization > --- > > Key: SPARK-32164 > URL: https://issues.apache.org/jira/browse/SPARK-32164 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > compute several statistics on single pass -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32164) GeneralizedLinearRegressionSummary optimization
[ https://issues.apache.org/jira/browse/SPARK-32164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150692#comment-17150692 ] Apache Spark commented on SPARK-32164: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/28990 > GeneralizedLinearRegressionSummary optimization > --- > > Key: SPARK-32164 > URL: https://issues.apache.org/jira/browse/SPARK-32164 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > compute several statistics on single pass -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32164) GeneralizedLinearRegressionSummary optimization
[ https://issues.apache.org/jira/browse/SPARK-32164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32164: Assignee: Apache Spark > GeneralizedLinearRegressionSummary optimization > --- > > Key: SPARK-32164 > URL: https://issues.apache.org/jira/browse/SPARK-32164 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > compute several statistics on single pass -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession
Xianjin YE created SPARK-32165: -- Summary: SessionState leaks SparkListener with multiple SparkSession Key: SPARK-32165 URL: https://issues.apache.org/jira/browse/SPARK-32165 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Xianjin YE Copied from [https://github.com/apache/spark/pull/28128#issuecomment-653102770] {code:java} test("SPARK-31354: SparkContext only register one SparkSession ApplicationEnd listener") { val conf = new SparkConf() .setMaster("local") .setAppName("test-app-SPARK-31354-1") val context = new SparkContext(conf) SparkSession .builder() .sparkContext(context) .master("local") .getOrCreate() .sessionState // this touches the sessionState val postFirstCreation = context.listenerBus.listeners.size() SparkSession.clearActiveSession() SparkSession.clearDefaultSession() SparkSession .builder() .sparkContext(context) .master("local") .getOrCreate() .sessionState // this touches the sessionState val postSecondCreation = context.listenerBus.listeners.size() SparkSession.clearActiveSession() SparkSession.clearDefaultSession() assert(postFirstCreation == postSecondCreation) } {code} The problem can be reproduced by the above code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32166) Metastore problem on Spark3.0 with Hive3.0
hzk created SPARK-32166: --- Summary: Metastore problem on Spark3.0 with Hive3.0 Key: SPARK-32166 URL: https://issues.apache.org/jira/browse/SPARK-32166 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: hzk When i use spark-sql to create table ,the problem appear. {code:java} create table bigbig as select b.user_id , b.name , b.age , c.address , c.city , a.position , a.object , a.problem , a.complaint_time from ( select user_id , position , object , problem , complaint_time from HIVE_COMBINE_7efde4e2dcb34c218b3fb08872e698d5 ) as a left join HIVE_ODS_17_TEST_DEMO_ODS_USERS_INFO_20200608141945 as b on b.user_id = a.user_id left join HIVE_ODS_17_TEST_ADDRESS_CITY_20200608141942 as c on c.address_id = b.address_id; {code} It opened a connection to hive metastore. my hive version is 3.1.0. {code:java} org.apache.thrift.TApplicationException: Required field 'filesAdded' is unset! Struct:InsertEventRequestData(filesAdded:null)org.apache.thrift.TApplicationException: Required field 'filesAdded' is unset! Struct:InsertEventRequestData(filesAdded:null) at org.apache.thrift.TApplicationException.read(TApplicationException.java:111) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_fire_listener_event(ThriftHiveMetastore.java:4182) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.fire_listener_event(ThriftHiveMetastore.java:4169) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.fireListenerEvent(HiveMetaStoreClient.java:1954) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156) at com.sun.proxy.$Proxy5.fireListenerEvent(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.fireInsertEvent(Hive.java:1947) at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1673) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:847) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:757) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:757) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:757) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:272) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:210) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:209) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:255) at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:756) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:829) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:827) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:827) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:827) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:416) at org.apache.spark.sql.execution.command.LoadDataCommand.run(tables.scala:403) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190) at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252) at org.apache.spark.sql.Dataset.(Dataset.scala:190) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75) at org.apache.spark.sql.S
[jira] [Resolved] (SPARK-25594) OOM in long running applications even with UI disabled
[ https://issues.apache.org/jira/browse/SPARK-25594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-25594. - Resolution: Won't Fix > OOM in long running applications even with UI disabled > -- > > Key: SPARK-25594 > URL: https://issues.apache.org/jira/browse/SPARK-25594 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.0 >Reporter: Mridul Muralidharan >Assignee: Mridul Muralidharan >Priority: Major > > Typically for long running applications with large number of tasks it is > common to disable UI to minimize overhead at driver. > Earlier, with spark ui disabled, only stage/job information was kept as part > of JobProgressListener. > As part of history server scalability fixes, particularly SPARK-20643, > inspite of disabling UI - task information continues to be maintained in > memory. > In our long running tests against spark thrift server, this eventually > results in OOM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25594) OOM in long running applications even with UI disabled
[ https://issues.apache.org/jira/browse/SPARK-25594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150729#comment-17150729 ] Mridul Muralidharan commented on SPARK-25594: - Given regression in functionality if this is merged, closing bug. See comment: https://github.com/apache/spark/pull/22609#issuecomment-426405757 > OOM in long running applications even with UI disabled > -- > > Key: SPARK-25594 > URL: https://issues.apache.org/jira/browse/SPARK-25594 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.0 >Reporter: Mridul Muralidharan >Assignee: Mridul Muralidharan >Priority: Major > > Typically for long running applications with large number of tasks it is > common to disable UI to minimize overhead at driver. > Earlier, with spark ui disabled, only stage/job information was kept as part > of JobProgressListener. > As part of history server scalability fixes, particularly SPARK-20643, > inspite of disabling UI - task information continues to be maintained in > memory. > In our long running tests against spark thrift server, this eventually > results in OOM. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org