[jira] [Commented] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605335#comment-16605335 ] Apache Spark commented on SPARK-25313: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/22346 > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12321) JSON format for logical/physical execution plans
[ https://issues.apache.org/jira/browse/SPARK-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605317#comment-16605317 ] Apache Spark commented on SPARK-12321: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/22345 > JSON format for logical/physical execution plans > > > Key: SPARK-12321 > URL: https://issues.apache.org/jira/browse/SPARK-12321 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12321) JSON format for logical/physical execution plans
[ https://issues.apache.org/jira/browse/SPARK-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605316#comment-16605316 ] Apache Spark commented on SPARK-12321: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/22345 > JSON format for logical/physical execution plans > > > Key: SPARK-12321 > URL: https://issues.apache.org/jira/browse/SPARK-12321 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605308#comment-16605308 ] Gengliang Wang commented on SPARK-24771: [~vanzin] I am OK with either way. Shading Avro 1.8 in data source only seems reasonable. But I am not confident enough to do the change. Can you open a PR for it? > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25352) Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold
[ https://issues.apache.org/jira/browse/SPARK-25352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605263#comment-16605263 ] Apache Spark commented on SPARK-25352: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/22344 > Perform ordered global limit when limit number is bigger than > topKSortFallbackThreshold > --- > > Key: SPARK-25352 > URL: https://issues.apache.org/jira/browse/SPARK-25352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > We have optimization on global limit to evenly distribute limit rows across > all partitions. This optimization doesn't work for ordered results. > For a query ending with sort + limit, in most cases it is performed by > `TakeOrderedAndProjectExec`. > But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, > global limit will be used. At this moment, we need to do ordered global limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25352) Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold
[ https://issues.apache.org/jira/browse/SPARK-25352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25352: Assignee: Apache Spark > Perform ordered global limit when limit number is bigger than > topKSortFallbackThreshold > --- > > Key: SPARK-25352 > URL: https://issues.apache.org/jira/browse/SPARK-25352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Major > > We have optimization on global limit to evenly distribute limit rows across > all partitions. This optimization doesn't work for ordered results. > For a query ending with sort + limit, in most cases it is performed by > `TakeOrderedAndProjectExec`. > But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, > global limit will be used. At this moment, we need to do ordered global limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25352) Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold
[ https://issues.apache.org/jira/browse/SPARK-25352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25352: Assignee: (was: Apache Spark) > Perform ordered global limit when limit number is bigger than > topKSortFallbackThreshold > --- > > Key: SPARK-25352 > URL: https://issues.apache.org/jira/browse/SPARK-25352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > We have optimization on global limit to evenly distribute limit rows across > all partitions. This optimization doesn't work for ordered results. > For a query ending with sort + limit, in most cases it is performed by > `TakeOrderedAndProjectExec`. > But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, > global limit will be used. At this moment, we need to do ordered global limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25352) Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold
Liang-Chi Hsieh created SPARK-25352: --- Summary: Perform ordered global limit when limit number is bigger than topKSortFallbackThreshold Key: SPARK-25352 URL: https://issues.apache.org/jira/browse/SPARK-25352 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Liang-Chi Hsieh We have optimization on global limit to evenly distribute limit rows across all partitions. This optimization doesn't work for ordered results. For a query ending with sort + limit, in most cases it is performed by `TakeOrderedAndProjectExec`. But if limit number is bigger than `SQLConf.TOP_K_SORT_FALLBACK_THRESHOLD`, global limit will be used. At this moment, we need to do ordered global limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25252) Support arrays of any types in to_json
[ https://issues.apache.org/jira/browse/SPARK-25252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25252. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 6 [https://github.com/apache/spark/pull/6] > Support arrays of any types in to_json > -- > > Key: SPARK-25252 > URL: https://issues.apache.org/jira/browse/SPARK-25252 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 2.4.0 > > > Need to improve the to_json function and make it more consistent with > from_json by supporting arrays of any types (as root types). For now, it > supports only arrays of structs and arrays of maps. After the changes the > following code should work: > {code:scala} > select to_json(array('1','2','3')) > > ["1","2","3"] > select to_json(array(array(1,2,3),array(4))) > > [[1,2,3],[4]] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25252) Support arrays of any types in to_json
[ https://issues.apache.org/jira/browse/SPARK-25252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-25252: Assignee: Maxim Gekk > Support arrays of any types in to_json > -- > > Key: SPARK-25252 > URL: https://issues.apache.org/jira/browse/SPARK-25252 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > Need to improve the to_json function and make it more consistent with > from_json by supporting arrays of any types (as root types). For now, it > supports only arrays of structs and arrays of maps. After the changes the > following code should work: > {code:scala} > select to_json(array('1','2','3')) > > ["1","2","3"] > select to_json(array(array(1,2,3),array(4))) > > [[1,2,3],[4]] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25344) Break large tests.py files into smaller files
[ https://issues.apache.org/jira/browse/SPARK-25344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605250#comment-16605250 ] Imran Rashid commented on SPARK-25344: -- kinda related, maybe this should get its own jira -- when you run the "pyspark-sql" tests, it also somehow runs {{SparkSubmitTests}}, which really should only be in the "pyspark-core" module. For me they take 80s, would be nice to eliminate that. I don't really understand why they get run in that module, but it does seem if I comment out the import in sql/tests.py, then they don't get run that extra time. We can't really do that, as the import is needed for the {{HiveSparkSubmitTests}}. But we should figure out why just importing it makes them run, and if we can do avoid that. > Break large tests.py files into smaller files > - > > Key: SPARK-25344 > URL: https://issues.apache.org/jira/browse/SPARK-25344 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Imran Rashid >Priority: Major > Labels: newbie > > We've got a ton of tests in one humongous tests.py file, rather than breaking > it out into smaller files. > Having one huge file doesn't seem great for code organization, and it also > makes the test parallelization in run-tests.py not work as well. On my > laptop, tests.py takes 150s, and the next longest test file takes only 20s. > There are similarly large files in other pyspark modules, eg. sql/tests.py, > ml/tests.py, mllib/tests.py, streaming/tests.py. > It seems that at least for some of these files, its already broken into > independent test classes, so it shouldn't be too hard to just move them into > their own files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasour
[ https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25337: - Assignee: Dongjoon Hyun > HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: > org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;) > > > Key: SPARK-25337 > URL: https://issues.apache.org/jira/browse/SPARK-25337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sean Owen >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.0 > > > Observed in the Scala 2.12 pull request builder consistently now. I don't see > this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of > hard to see how. > CC [~sadhen] > {code:java} > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** > Exception encountered when invoking run on a nested suite - spark-submit > returned with exit code 1. > Command line: './bin/spark-submit' '--name' 'prepare testing tables' > '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' > 'spark.master.rest.enabled=false' '--conf' > 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' > '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' > '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' > > '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py' > ... > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/session.py", > line 545, in sql > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 63, in deco > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > 2018-09-04 20:00:04.95 - stdout> py4j.protocol.Py4JJavaError: An error > occurred while calling o27.sql. > 2018-09-04 20:00:04.95 - stdout> : java.util.ServiceConfigurationError: > org.apache.spark.sql.sources.DataSourceRegister: Provider > org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasour
[ https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25337. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22340 [https://github.com/apache/spark/pull/22340] > HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: > org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;) > > > Key: SPARK-25337 > URL: https://issues.apache.org/jira/browse/SPARK-25337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sean Owen >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.0 > > > Observed in the Scala 2.12 pull request builder consistently now. I don't see > this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of > hard to see how. > CC [~sadhen] > {code:java} > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** > Exception encountered when invoking run on a nested suite - spark-submit > returned with exit code 1. > Command line: './bin/spark-submit' '--name' 'prepare testing tables' > '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' > 'spark.master.rest.enabled=false' '--conf' > 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' > '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' > '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' > > '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py' > ... > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/session.py", > line 545, in sql > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 63, in deco > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > 2018-09-04 20:00:04.95 - stdout> py4j.protocol.Py4JJavaError: An error > occurred while calling o27.sql. > 2018-09-04 20:00:04.95 - stdout> : java.util.ServiceConfigurationError: > org.apache.spark.sql.sources.DataSourceRegister: Provider > org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20918) Use FunctionIdentifier as function identifiers in FunctionRegistry
[ https://issues.apache.org/jira/browse/SPARK-20918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-20918: Labels: release-notes (was: ) > Use FunctionIdentifier as function identifiers in FunctionRegistry > -- > > Key: SPARK-20918 > URL: https://issues.apache.org/jira/browse/SPARK-20918 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Labels: release-notes > Fix For: 2.3.0 > > > Currently, the unquoted string of a function identifier is being used as the > function identifier in the function registry. This could cause the incorrect > the behavior when users use `.` in the function names. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25313. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22320 [https://github.com/apache/spark/pull/22320] > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.4.0 > > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25313) Fix regression in FileFormatWriter output schema
[ https://issues.apache.org/jira/browse/SPARK-25313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-25313: --- Assignee: Gengliang Wang > Fix regression in FileFormatWriter output schema > > > Key: SPARK-25313 > URL: https://issues.apache.org/jira/browse/SPARK-25313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > In the follow example: > val location = "/tmp/t" > val df = spark.range(10).toDF("id") > df.write.format("parquet").saveAsTable("tbl") > spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") > spark.sql(s"CREATE TABLE tbl2(ID long) USING parquet location > $location") > spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") > println(spark.read.parquet(location).schema) > spark.table("tbl2").show() > The output column name in schema will be id instead of ID, thus the last > query shows nothing from tbl2. > By enabling the debug message we can see that the output naming is changed > from ID to id, and then the outputColumns in > InsertIntoHadoopFsRelationCommand is changed in RemoveRedundantAliases. > To guarantee correctness, we should change the output columns from > `Seq[Attribute]` to `Seq[String]` to avoid its names being replaced by > optimizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605186#comment-16605186 ] Hyukjin Kwon commented on SPARK-18112: -- We need the metastore jar if I understood correctly. FWIW, I am seeing few tests internally running with different metastore support. I doubt if there's an issue about its supportability itself. > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25346) Document Spark builtin data sources
[ https://issues.apache.org/jira/browse/SPARK-25346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605141#comment-16605141 ] Hyukjin Kwon commented on SPARK-25346: -- Avro - documentation was added in SPARK-25133 I agree there isn't an explicit documentation that lists builtin datasource; however, I wonder if it actually blocks SPARK-25347 since it can be added in other forms like the examples above. > Document Spark builtin data sources > --- > > Key: SPARK-25346 > URL: https://issues.apache.org/jira/browse/SPARK-25346 > Project: Spark > Issue Type: Story > Components: Documentation >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Major > > It would be nice to list built-in data sources in the doc site. So users know > what are available by default. However, I didn't find any from 2.3.1 docs. > > cc: [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25346) Document Spark builtin data sources
[ https://issues.apache.org/jira/browse/SPARK-25346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605140#comment-16605140 ] Hyukjin Kwon commented on SPARK-25346: -- [~mengxr], actually there are documentation for several datasources. For example, Parquet - https://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files ORC - https://spark.apache.org/docs/latest/sql-programming-guide.html#orc-files JSON - https://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets CSV - https://spark.apache.org/docs/latest/sql-programming-guide.html#manually-specifying-options (there were few tries for CSV documentation but they were failed for the sake of duplicated API documentation in DataFrameReader) JDBC - https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases > Document Spark builtin data sources > --- > > Key: SPARK-25346 > URL: https://issues.apache.org/jira/browse/SPARK-25346 > Project: Spark > Issue Type: Story > Components: Documentation >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Major > > It would be nice to list built-in data sources in the doc site. So users know > what are available by default. However, I didn't find any from 2.3.1 docs. > > cc: [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7
[ https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605131#comment-16605131 ] Yuming Wang edited comment on SPARK-25330 at 9/6/18 1:09 AM: - I try to build Hadoop 2.7.7 with [{{Configuration.getRestrictParserDefault(Object resource)}}|https://github.com/apache/hadoop/blob/release-2.7.7-RC0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java#L236] = true and false. It succeeded when {{Configuration.getRestrictParserDefault(Object resource)=false}}, but failed when {{Configuration.getRestrictParserDefault(Object resource)=true}}. was (Author: q79969786): I try to build Hadoop 2.7.7 with[{{Configuration.getRestrictParserDefault(Object resource)}}|https://github.com/apache/hadoop/blob/release-2.7.7-RC0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java#L236] = true and false. It succeeded when {{Configuration.getRestrictParserDefault(Object resource)=false}}, but failed when {{Configuration.getRestrictParserDefault(Object resource)=true}}. > Permission issue after upgrade hadoop version to 2.7.7 > -- > > Key: SPARK-25330 > URL: https://issues.apache.org/jira/browse/SPARK-25330 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:java} > # build spark > ./dev/make-distribution.sh --name SPARK-25330 --tgz -Phadoop-2.7 -Phive > -Phive-thriftserver -Pyarn > tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd > spark-2.4.0-SNAPSHOT-bin-SPARK-25330 > export HADOOP_PROXY_USER=user_a > bin/spark-sql > export HADOOP_PROXY_USER=user_b > bin/spark-sql{code} > > {noformat} > Exception in thread "main" java.lang.RuntimeException: > org.apache.hadoop.security.AccessControlException: Permission denied: > user=user_b, access=EXECUTE, > inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7
[ https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605131#comment-16605131 ] Yuming Wang commented on SPARK-25330: - I try to build Hadoop 2.7.7 with[{{Configuration.getRestrictParserDefault(Object resource)}}|https://github.com/apache/hadoop/blob/release-2.7.7-RC0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java#L236] = true and false. It succeeded when {{Configuration.getRestrictParserDefault(Object resource)=false}}, but failed when {{Configuration.getRestrictParserDefault(Object resource)=true}}. > Permission issue after upgrade hadoop version to 2.7.7 > -- > > Key: SPARK-25330 > URL: https://issues.apache.org/jira/browse/SPARK-25330 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:java} > # build spark > ./dev/make-distribution.sh --name SPARK-25330 --tgz -Phadoop-2.7 -Phive > -Phive-thriftserver -Pyarn > tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd > spark-2.4.0-SNAPSHOT-bin-SPARK-25330 > export HADOOP_PROXY_USER=user_a > bin/spark-sql > export HADOOP_PROXY_USER=user_b > bin/spark-sql{code} > > {noformat} > Exception in thread "main" java.lang.RuntimeException: > org.apache.hadoop.security.AccessControlException: Permission denied: > user=user_b, access=EXECUTE, > inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7
[ https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605129#comment-16605129 ] Yuming Wang commented on SPARK-25330: - No. The issue occurred in this commit: [apache/hadoop@{{feb886f}}|https://github.com/apache/hadoop/commit/feb886f2093ea5da0cd09c69bd1360a335335c86]. > Permission issue after upgrade hadoop version to 2.7.7 > -- > > Key: SPARK-25330 > URL: https://issues.apache.org/jira/browse/SPARK-25330 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:java} > # build spark > ./dev/make-distribution.sh --name SPARK-25330 --tgz -Phadoop-2.7 -Phive > -Phive-thriftserver -Pyarn > tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd > spark-2.4.0-SNAPSHOT-bin-SPARK-25330 > export HADOOP_PROXY_USER=user_a > bin/spark-sql > export HADOOP_PROXY_USER=user_b > bin/spark-sql{code} > > {noformat} > Exception in thread "main" java.lang.RuntimeException: > org.apache.hadoop.security.AccessControlException: Permission denied: > user=user_b, access=EXECUTE, > inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7
[ https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605114#comment-16605114 ] Eric Yang commented on SPARK-25330: --- [~yumwang] Does Hadoop 2.7.5 works? It might help us to isolate the release that started the regression to isolate the number of JIRAs that Hadoop team needs to go through. Thanks > Permission issue after upgrade hadoop version to 2.7.7 > -- > > Key: SPARK-25330 > URL: https://issues.apache.org/jira/browse/SPARK-25330 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:java} > # build spark > ./dev/make-distribution.sh --name SPARK-25330 --tgz -Phadoop-2.7 -Phive > -Phive-thriftserver -Pyarn > tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd > spark-2.4.0-SNAPSHOT-bin-SPARK-25330 > export HADOOP_PROXY_USER=user_a > bin/spark-sql > export HADOOP_PROXY_USER=user_b > bin/spark-sql{code} > > {noformat} > Exception in thread "main" java.lang.RuntimeException: > org.apache.hadoop.security.AccessControlException: Permission denied: > user=user_b, access=EXECUTE, > inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25268) runParallelPersonalizedPageRank throws serialization Exception
[ https://issues.apache.org/jira/browse/SPARK-25268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-25268: - Assignee: shahid > runParallelPersonalizedPageRank throws serialization Exception > -- > > Key: SPARK-25268 > URL: https://issues.apache.org/jira/browse/SPARK-25268 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.4.0 >Reporter: Bago Amirbekian >Assignee: shahid >Priority: Critical > > A recent change to PageRank introduced a bug in the > ParallelPersonalizedPageRank implementation. The change prevents > serialization of a Map which needs to be broadcast to all workers. The issue > is in this line here: > [https://github.com/apache/spark/blob/6c5cb85856235efd464b109558896f81ae2c4c75/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala#L201] > Because graphx units tests are run in local mode, the Serialization issue is > not caught. > > {code:java} > [info] - Star example parallel personalized PageRank *** FAILED *** (2 > seconds, 160 milliseconds) > [info] java.io.NotSerializableException: > scala.collection.immutable.MapLike$$anon$2 > [info] Serialization stack: > [info] - object not serializable (class: > scala.collection.immutable.MapLike$$anon$2, value: Map(1 -> > SparseVector(3)((0,1.0)), 2 -> SparseVector(3)((1,1.0)), 3 -> > SparseVector(3)((2,1.0 > [info] at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) > [info] at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) > [info] at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291) > [info] at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291) > [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348) > [info] at > org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:292) > [info] at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127) > [info] at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88) > [info] at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > [info] at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) > [info] at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489) > [info] at > org.apache.spark.graphx.lib.PageRank$.runParallelPersonalizedPageRank(PageRank.scala:205) > [info] at > org.apache.spark.graphx.lib.GraphXHelpers$.runParallelPersonalizedPageRank(GraphXHelpers.scala:31) > [info] at > org.graphframes.lib.ParallelPersonalizedPageRank$.run(ParallelPersonalizedPageRank.scala:115) > [info] at > org.graphframes.lib.ParallelPersonalizedPageRank.run(ParallelPersonalizedPageRank.scala:84) > [info] at > org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply$mcV$sp(ParallelPersonalizedPageRankSuite.scala:62) > [info] at > org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51) > [info] at > org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51) > [info] at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > [info] at org.graphframes.SparkFunSuite.withFixture(SparkFunSuite.scala:40) > [info] at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info] at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > [info] at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > [info] at scala.collection.immutable.List.foreach(List.scala:383)
[jira] [Updated] (SPARK-20901) Feature parity for ORC with Parquet
[ https://issues.apache.org/jira/browse/SPARK-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-20901: -- Affects Version/s: 2.4.0 > Feature parity for ORC with Parquet > --- > > Key: SPARK-20901 > URL: https://issues.apache.org/jira/browse/SPARK-20901 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1, 2.2.0, 2.2.1, 2.3.0, 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to track the feature parity for ORC with Parquet. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23774) `Cast` to CHAR/VARCHAR should truncate the values
[ https://issues.apache.org/jira/browse/SPARK-23774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-23774. --- Resolution: Won't Do Per review comments, we will revisit this when we can support CHAR/VARCHAR natively. > `Cast` to CHAR/VARCHAR should truncate the values > - > > Key: SPARK-23774 > URL: https://issues.apache.org/jira/browse/SPARK-23774 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.2, 2.2.1, 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to fix the following `CAST` behavior on `CHAR/VARCHAR` types. > Since HiveStringType is used only in parsing, this PR is also about parsing. > *Spark* > {code} > scala> sql("SELECT CAST('123' AS CHAR(1)), CAST('123' AS VARCHAR(1))").show > +---+---+ > |CAST(123 AS STRING)|CAST(123 AS STRING)| > +---+---+ > |123|123| > +---+---+ > scala> sql("SELECT CAST('123' AS CHAR(0)), CAST('123' AS VARCHAR(0))").show > +---+---+ > |CAST(123 AS STRING)|CAST(123 AS STRING)| > +---+---+ > |123|123| > +---+---+ > {code} > *Hive* > {code} > hive> SELECT CAST('123' AS CHAR(1)), CAST('123' AS VARCHAR(1)); > OK > 1 1 > hive> SELECT CAST('123' AS CHAR(0)), CAST('123' AS VARCHAR(0)); > FAILED: RuntimeException Char length 0 out of allowed range [1, 255] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23131) Kryo raises StackOverflow during serializing GLR model
[ https://issues.apache.org/jira/browse/SPARK-23131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-23131: - Assignee: Yuming Wang > Kryo raises StackOverflow during serializing GLR model > -- > > Key: SPARK-23131 > URL: https://issues.apache.org/jira/browse/SPARK-23131 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.2.0 >Reporter: Peigen >Assignee: Yuming Wang >Priority: Minor > Fix For: 2.4.0 > > > When trying to use GeneralizedLinearRegression model and set SparkConf to use > KryoSerializer(JavaSerializer is fine) > It causes StackOverflowException > {quote}Exception in thread "dispatcher-event-loop-34" > java.lang.StackOverflowError > at java.util.HashMap.hash(HashMap.java:338) > at java.util.HashMap.get(HashMap.java:556) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:61) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > {quote} > This is very likely to be > [https://github.com/EsotericSoftware/kryo/issues/341] > Upgrade Kryo to 4.0+ probably could fix this > > Wish for upgrade Kryo version for spark -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy
[ https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25176: - Assignee: Yuming Wang > Kryo fails to serialize a parametrised type hierarchy > - > > Key: SPARK-25176 > URL: https://issues.apache.org/jira/browse/SPARK-25176 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.1 >Reporter: Mikhail Pryakhin >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > > I'm using the latest spark version spark-core_2.11:2.3.1 which > transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the > com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo > serializer contains an issue [1,2] which results in throwing > ClassCastExceptions when serialising parameterised type hierarchy. > This issue has been fixed in kryo version 4.0.0 [3]. It would be great to > have this update in Spark as well. Could you please upgrade the version of > com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2? > You can find a simple test to reproduce the issue [4]. > [1] https://github.com/EsotericSoftware/kryo/issues/384 > [2] https://github.com/EsotericSoftware/kryo/issues/377 > [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0 > [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25258) Upgrade kryo package to version 4.0.2
[ https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25258: - Assignee: Yuming Wang > Upgrade kryo package to version 4.0.2 > - > > Key: SPARK-25258 > URL: https://issues.apache.org/jira/browse/SPARK-25258 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.1.0, 2.3.1 >Reporter: liupengcheng >Assignee: Yuming Wang >Priority: Major > Fix For: 2.4.0 > > > Recently, we encountered a kryo performance issue in spark2.1.0, and the > issue affect all kryo below 4.0.2, so it seems that all spark version might > encounter this issue. > Issue description: > In shuffle write phase or some spilling operation, spark will use kryo > serializer to serialize data if `spark.serializer` is set to > `KryoSerializer`, however, when data contains some extremely large records, > kryoSerializer's MapReferenceResolver would be expand, and it's `reset` > method will take a long time to reset all items in writtenObjects table to > null. > com.esotericsoftware.kryo.util.MapReferenceResolver > {code:java} > public void reset () { > readObjects.clear(); > writtenObjects.clear(); > } > public void clear () { > K[] keyTable = this.keyTable; > for (int i = capacity + stashSize; i-- > 0;) > keyTable[i] = null; > size = 0; > stashSize = 0; > } > {code} > I checked the kryo project in github, and this issue seems fixed in 4.0.2+ > [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28] > > I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix > this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy
[ https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25176. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22179 [https://github.com/apache/spark/pull/22179] > Kryo fails to serialize a parametrised type hierarchy > - > > Key: SPARK-25176 > URL: https://issues.apache.org/jira/browse/SPARK-25176 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.1 >Reporter: Mikhail Pryakhin >Priority: Major > Fix For: 2.4.0 > > > I'm using the latest spark version spark-core_2.11:2.3.1 which > transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the > com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo > serializer contains an issue [1,2] which results in throwing > ClassCastExceptions when serialising parameterised type hierarchy. > This issue has been fixed in kryo version 4.0.0 [3]. It would be great to > have this update in Spark as well. Could you please upgrade the version of > com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2? > You can find a simple test to reproduce the issue [4]. > [1] https://github.com/EsotericSoftware/kryo/issues/384 > [2] https://github.com/EsotericSoftware/kryo/issues/377 > [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0 > [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23131) Kryo raises StackOverflow during serializing GLR model
[ https://issues.apache.org/jira/browse/SPARK-23131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23131. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22179 [https://github.com/apache/spark/pull/22179] > Kryo raises StackOverflow during serializing GLR model > -- > > Key: SPARK-23131 > URL: https://issues.apache.org/jira/browse/SPARK-23131 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.2.0 >Reporter: Peigen >Priority: Minor > Fix For: 2.4.0 > > > When trying to use GeneralizedLinearRegression model and set SparkConf to use > KryoSerializer(JavaSerializer is fine) > It causes StackOverflowException > {quote}Exception in thread "dispatcher-event-loop-34" > java.lang.StackOverflowError > at java.util.HashMap.hash(HashMap.java:338) > at java.util.HashMap.get(HashMap.java:556) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:61) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > {quote} > This is very likely to be > [https://github.com/EsotericSoftware/kryo/issues/341] > Upgrade Kryo to 4.0+ probably could fix this > > Wish for upgrade Kryo version for spark -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25258) Upgrade kryo package to version 4.0.2
[ https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25258. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22179 [https://github.com/apache/spark/pull/22179] > Upgrade kryo package to version 4.0.2 > - > > Key: SPARK-25258 > URL: https://issues.apache.org/jira/browse/SPARK-25258 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.1.0, 2.3.1 >Reporter: liupengcheng >Priority: Major > Fix For: 2.4.0 > > > Recently, we encountered a kryo performance issue in spark2.1.0, and the > issue affect all kryo below 4.0.2, so it seems that all spark version might > encounter this issue. > Issue description: > In shuffle write phase or some spilling operation, spark will use kryo > serializer to serialize data if `spark.serializer` is set to > `KryoSerializer`, however, when data contains some extremely large records, > kryoSerializer's MapReferenceResolver would be expand, and it's `reset` > method will take a long time to reset all items in writtenObjects table to > null. > com.esotericsoftware.kryo.util.MapReferenceResolver > {code:java} > public void reset () { > readObjects.clear(); > writtenObjects.clear(); > } > public void clear () { > K[] keyTable = this.keyTable; > for (int i = capacity + stashSize; i-- > 0;) > keyTable[i] = null; > size = 0; > stashSize = 0; > } > {code} > I checked the kryo project in github, and this issue seems fixed in 4.0.2+ > [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28] > > I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix > this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25335) Skip Zinc downloading if it's installed in the system
[ https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25335. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22333 [https://github.com/apache/spark/pull/22333] > Skip Zinc downloading if it's installed in the system > - > > Key: SPARK-25335 > URL: https://issues.apache.org/jira/browse/SPARK-25335 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.4.0 > > > Zinc is 23.5MB. > {code} > $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > % Total% Received % Xferd Average Speed TimeTime Time > Current > Dload Upload Total SpentLeft Speed > 100 23.5M 100 23.5M0 0 35.4M 0 --:--:-- --:--:-- --:--:-- 35.3M > {code} > Currently, Spark downloads Zinc once. However, it occurs too many times in > build systems. This issue aims to skip Zinc downloading when the system > already has it. > {code} > $ build/mvn clean > exec: curl --progress-bar -L > https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > > 100.0% > {code} > This will reduce many resources(CPU/Networks/DISK) at least in Mac and > Docker-based build system. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25335) Skip Zinc downloading if it's installed in the system
[ https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25335: - Assignee: Dongjoon Hyun > Skip Zinc downloading if it's installed in the system > - > > Key: SPARK-25335 > URL: https://issues.apache.org/jira/browse/SPARK-25335 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > Zinc is 23.5MB. > {code} > $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > % Total% Received % Xferd Average Speed TimeTime Time > Current > Dload Upload Total SpentLeft Speed > 100 23.5M 100 23.5M0 0 35.4M 0 --:--:-- --:--:-- --:--:-- 35.3M > {code} > Currently, Spark downloads Zinc once. However, it occurs too many times in > build systems. This issue aims to skip Zinc downloading when the system > already has it. > {code} > $ build/mvn clean > exec: curl --progress-bar -L > https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > > 100.0% > {code} > This will reduce many resources(CPU/Networks/DISK) at least in Mac and > Docker-based build system. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23243) Shuffle+Repartition on an RDD could lead to incorrect answers
[ https://issues.apache.org/jira/browse/SPARK-23243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23243. - Resolution: Fixed Assignee: Wenchen Fan Fix Version/s: 2.4.0 > Shuffle+Repartition on an RDD could lead to incorrect answers > - > > Key: SPARK-23243 > URL: https://issues.apache.org/jira/browse/SPARK-23243 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: Jiang Xingbo >Assignee: Wenchen Fan >Priority: Blocker > Labels: correctness > Fix For: 2.4.0 > > > The RDD repartition also uses the round-robin way to distribute data, this > can also cause incorrect answers on RDD workload the similar way as in > https://issues.apache.org/jira/browse/SPARK-23207 > The approach that fixes DataFrame.repartition() doesn't apply on the RDD > repartition issue, as discussed in > https://github.com/apache/spark/pull/20393#issuecomment-360912451 > We track for alternative solutions for this issue in this task. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25268) runParallelPersonalizedPageRank throws serialization Exception
[ https://issues.apache.org/jira/browse/SPARK-25268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-25268: -- Shepherd: Joseph K. Bradley > runParallelPersonalizedPageRank throws serialization Exception > -- > > Key: SPARK-25268 > URL: https://issues.apache.org/jira/browse/SPARK-25268 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.4.0 >Reporter: Bago Amirbekian >Priority: Critical > > A recent change to PageRank introduced a bug in the > ParallelPersonalizedPageRank implementation. The change prevents > serialization of a Map which needs to be broadcast to all workers. The issue > is in this line here: > [https://github.com/apache/spark/blob/6c5cb85856235efd464b109558896f81ae2c4c75/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala#L201] > Because graphx units tests are run in local mode, the Serialization issue is > not caught. > > {code:java} > [info] - Star example parallel personalized PageRank *** FAILED *** (2 > seconds, 160 milliseconds) > [info] java.io.NotSerializableException: > scala.collection.immutable.MapLike$$anon$2 > [info] Serialization stack: > [info] - object not serializable (class: > scala.collection.immutable.MapLike$$anon$2, value: Map(1 -> > SparseVector(3)((0,1.0)), 2 -> SparseVector(3)((1,1.0)), 3 -> > SparseVector(3)((2,1.0 > [info] at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) > [info] at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) > [info] at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291) > [info] at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291) > [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348) > [info] at > org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:292) > [info] at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127) > [info] at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88) > [info] at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > [info] at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) > [info] at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489) > [info] at > org.apache.spark.graphx.lib.PageRank$.runParallelPersonalizedPageRank(PageRank.scala:205) > [info] at > org.apache.spark.graphx.lib.GraphXHelpers$.runParallelPersonalizedPageRank(GraphXHelpers.scala:31) > [info] at > org.graphframes.lib.ParallelPersonalizedPageRank$.run(ParallelPersonalizedPageRank.scala:115) > [info] at > org.graphframes.lib.ParallelPersonalizedPageRank.run(ParallelPersonalizedPageRank.scala:84) > [info] at > org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply$mcV$sp(ParallelPersonalizedPageRankSuite.scala:62) > [info] at > org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51) > [info] at > org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51) > [info] at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > [info] at org.graphframes.SparkFunSuite.withFixture(SparkFunSuite.scala:40) > [info] at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info] at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > [info] at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > [info] at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > [info] at scala.collection.immutable.List.foreach(List.scala:383) > [info] at org.scalatest
[jira] [Resolved] (SPARK-25231) Running a Large Job with Speculation On Causes Executor Heartbeats to Time Out on Driver
[ https://issues.apache.org/jira/browse/SPARK-25231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-25231. --- Resolution: Fixed Assignee: Parth Gandhi Fix Version/s: 2.4.0 2.3.2 > Running a Large Job with Speculation On Causes Executor Heartbeats to Time > Out on Driver > > > Key: SPARK-25231 > URL: https://issues.apache.org/jira/browse/SPARK-25231 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 2.3.1 >Reporter: Parth Gandhi >Assignee: Parth Gandhi >Priority: Major > Fix For: 2.3.2, 2.4.0 > > > Running a large Spark job with speculation turned on was causing executor > heartbeats to time out on the driver end after sometime and eventually, after > hitting the max number of executor failures, the job would fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25351) Handle Pandas category type when converting from Python with Arrow
Bryan Cutler created SPARK-25351: Summary: Handle Pandas category type when converting from Python with Arrow Key: SPARK-25351 URL: https://issues.apache.org/jira/browse/SPARK-25351 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 2.3.1 Reporter: Bryan Cutler There needs to be some handling of category types done when calling {{createDataFrame}} with Arrow or the return value of {{pandas_udf}}. Without Arrow, Spark casts each element to the category. For example {noformat} In [1]: import pandas as pd In [2]: pdf = pd.DataFrame({"A":[u"a",u"b",u"c",u"a"]}) In [3]: pdf["B"] = pdf["A"].astype('category') In [4]: pdf Out[4]: A B 0 a a 1 b b 2 c c 3 a a In [5]: pdf.dtypes Out[5]: A object Bcategory dtype: object In [7]: spark.conf.set("spark.sql.execution.arrow.enabled", False) In [8]: df = spark.createDataFrame(pdf) In [9]: df.show() +---+---+ | A| B| +---+---+ | a| a| | b| b| | c| c| | a| a| +---+---+ In [10]: df.printSchema() root |-- A: string (nullable = true) |-- B: string (nullable = true) In [18]: spark.conf.set("spark.sql.execution.arrow.enabled", True) In [19]: df = spark.createDataFrame(pdf) 1667 spark_type = ArrayType(from_arrow_type(at.value_type)) 1668 else: -> 1669 raise TypeError("Unsupported type in conversion from Arrow: " + str(at)) 1670 return spark_type 1671 TypeError: Unsupported type in conversion from Arrow: dictionary {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters
[ https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-21187: - Description: This is to track adding the remaining type support in Arrow Converters. Currently, only primitive data types are supported. ' Remaining types: * -*Date*- * -*Timestamp*- * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map * -*Decimal*- * -*Binary*- * Categorical when converting from Pandas Some things to do before closing this out: * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write values as BigDecimal)- * -Need to add some user docs- * -Make sure Python tests are thorough- * Check into complex type support mentioned in comments by [~leif], should we support mulit-indexing? was: This is to track adding the remaining type support in Arrow Converters. Currently, only primitive data types are supported. ' Remaining types: * -*Date*- * -*Timestamp*- * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map * -*Decimal*- * -*Binary*- Some things to do before closing this out: * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write values as BigDecimal)- * -Need to add some user docs- * -Make sure Python tests are thorough- * Check into complex type support mentioned in comments by [~leif], should we support mulit-indexing? > Complete support for remaining Spark data types in Arrow Converters > --- > > Key: SPARK-21187 > URL: https://issues.apache.org/jira/browse/SPARK-21187 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Major > > This is to track adding the remaining type support in Arrow Converters. > Currently, only primitive data types are supported. ' > Remaining types: > * -*Date*- > * -*Timestamp*- > * *Complex*: Struct, -Array-, Arrays of Date/Timestamps, Map > * -*Decimal*- > * -*Binary*- > * Categorical when converting from Pandas > Some things to do before closing this out: > * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write > values as BigDecimal)- > * -Need to add some user docs- > * -Make sure Python tests are thorough- > * Check into complex type support mentioned in comments by [~leif], should > we support mulit-indexing? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19809) NullPointerException on zero-size ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604926#comment-16604926 ] Shirish Tatikonda commented on SPARK-19809: --- Thank you [~dongjoon] > NullPointerException on zero-size ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.1, 2.2.1 >Reporter: Michał Dawid >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.3.0 > > Attachments: image-2018-02-26-20-29-49-410.png, > spark.sql.hive.convertMetastoreOrc.txt > > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) >
[jira] [Commented] (SPARK-24771) Upgrade AVRO version from 1.7.7 to 1.8
[ https://issues.apache.org/jira/browse/SPARK-24771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604877#comment-16604877 ] Marcelo Vanzin commented on SPARK-24771: I ran a couple of our tests that exercise avro and they worked fine with 2.4. They're not comprehensive, though: - one uses the data source to read / write data, and that shouldn't really be affected by the change - the other uses {{GenericRecord}}, so it doesn't really use generated Avro types. So I don't really have a test that can say for sure what will break when you use generated types, which is the part that is explicitly called as being changed in 1.8. I still think it would be good to try to shade Avro 1.8 in the data source, and not expose it to other parts of Spark, but otherwise a strongly worded release note might be ok, although not optimal. > Upgrade AVRO version from 1.7.7 to 1.8 > -- > > Key: SPARK-24771 > URL: https://issues.apache.org/jira/browse/SPARK-24771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: release-notes > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25350) Spark Serving
[ https://issues.apache.org/jira/browse/SPARK-25350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604842#comment-16604842 ] Mark Hamilton commented on SPARK-25350: --- Hey, [~rxin], we had talked about this contribution at this past Spark + AI summit and I was wondering if you could at mention someone from your team who would like to check it out and give comments. Thanks so much for the help! > Spark Serving > - > > Key: SPARK-25350 > URL: https://issues.apache.org/jira/browse/SPARK-25350 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mark Hamilton >Priority: Major > Labels: features > > Microsoft has created a new system to turn Structured Streaming jobs into > RESTful web services. We would like to commit this work back to the > community. > More information can be found at the [ MMLSpark > website|[http://www.aka.ms/spark]] > And the [ Spark Serving Documentation > page|[https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md]] > > The code can be found in the MMLSpark Repo and a PR will be made soon: > [https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala] > > Thanks for your help and feedback! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25350) Spark Serving
[ https://issues.apache.org/jira/browse/SPARK-25350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamilton updated SPARK-25350: -- Description: Microsoft has created a new system to turn Structured Streaming jobs into RESTful web services. We would like to commit this work back to the community. More information can be found at the [MMLSpark website|[http://www.aka.ms/spark]] And the [Spark Serving Documentation page|[https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md] ] The code can be found in the MMLSpark Repo and a PR will be made soon: [https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala] Thanks for your help and feedback! was: Microsoft has created a new system to turn Structured Streaming jobs into RESTful web services. We would like to commit this work back to the community. More information can be found at the [MMLSpark website | [http://www.aka.ms/spark]] And the [Spark Serving Documentation page | [https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md] ] The code can be found in the MMLSpark Repo and a PR will be made soon: [https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala] Thanks for your help and feedback! > Spark Serving > - > > Key: SPARK-25350 > URL: https://issues.apache.org/jira/browse/SPARK-25350 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mark Hamilton >Priority: Major > Labels: features > > Microsoft has created a new system to turn Structured Streaming jobs into > RESTful web services. We would like to commit this work back to the > community. > More information can be found at the [MMLSpark > website|[http://www.aka.ms/spark]] > And the [Spark Serving Documentation > page|[https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md] > ] > > The code can be found in the MMLSpark Repo and a PR will be made soon: > [https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala] > > Thanks for your help and feedback! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25350) Spark Serving
[ https://issues.apache.org/jira/browse/SPARK-25350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamilton updated SPARK-25350: -- Description: Microsoft has created a new system to turn Structured Streaming jobs into RESTful web services. We would like to commit this work back to the community. More information can be found at the [ MMLSpark website|[http://www.aka.ms/spark]] And the [ Spark Serving Documentation page|[https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md]] The code can be found in the MMLSpark Repo and a PR will be made soon: [https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala] Thanks for your help and feedback! was: Microsoft has created a new system to turn Structured Streaming jobs into RESTful web services. We would like to commit this work back to the community. More information can be found at the [MMLSpark website|[http://www.aka.ms/spark]] And the [Spark Serving Documentation page|[https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md] ] The code can be found in the MMLSpark Repo and a PR will be made soon: [https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala] Thanks for your help and feedback! > Spark Serving > - > > Key: SPARK-25350 > URL: https://issues.apache.org/jira/browse/SPARK-25350 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mark Hamilton >Priority: Major > Labels: features > > Microsoft has created a new system to turn Structured Streaming jobs into > RESTful web services. We would like to commit this work back to the > community. > More information can be found at the [ MMLSpark > website|[http://www.aka.ms/spark]] > And the [ Spark Serving Documentation > page|[https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md]] > > The code can be found in the MMLSpark Repo and a PR will be made soon: > [https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala] > > Thanks for your help and feedback! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25350) Spark Serving
Mark Hamilton created SPARK-25350: - Summary: Spark Serving Key: SPARK-25350 URL: https://issues.apache.org/jira/browse/SPARK-25350 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 2.3.1 Reporter: Mark Hamilton Microsoft has created a new system to turn Structured Streaming jobs into RESTful web services. We would like to commit this work back to the community. More information can be found at the [MMLSpark website | [http://www.aka.ms/spark]] And the [Spark Serving Documentation page | [https://github.com/Azure/mmlspark/blob/master/docs/mmlspark-serving.md] ] The code can be found in the MMLSpark Repo and a PR will be made soon: [https://github.com/Azure/mmlspark/blob/master/src/io/http/src/main/scala/HTTPSource.scala] Thanks for your help and feedback! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25347) Document image data source in doc site
[ https://issues.apache.org/jira/browse/SPARK-25347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-25347: -- Summary: Document image data source in doc site (was: Document image data sources in doc site) > Document image data source in doc site > -- > > Key: SPARK-25347 > URL: https://issues.apache.org/jira/browse/SPARK-25347 > Project: Spark > Issue Type: Story > Components: Documentation >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Major > > Currently, we only have Scala/Java API docs for image data source. It would > be nice to have some documentation in the doc site. So Python/R users can > also discover this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25345) Deprecate public APIs from ImageSchema
[ https://issues.apache.org/jira/browse/SPARK-25345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-25345: -- Description: After SPARK-22328, we can deprecate the public APIs in ImageSchema (Scala/Python) and remove them in Spark 3.0 (TODO: create JIRA). So users get a unified approach to load images w/ Spark. (was: After SPARK-22328, we can deprecate the public APIs in ImageSchema and remove them in Spark 3.0 (TODO: create JIRA). So users get a unified approach to load images w/ Spark.) > Deprecate public APIs from ImageSchema > -- > > Key: SPARK-25345 > URL: https://issues.apache.org/jira/browse/SPARK-25345 > Project: Spark > Issue Type: Story > Components: ML >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Major > > After SPARK-22328, we can deprecate the public APIs in ImageSchema > (Scala/Python) and remove them in Spark 3.0 (TODO: create JIRA). So users get > a unified approach to load images w/ Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25349) Support sample pushdown in Data Source V2
Xiangrui Meng created SPARK-25349: - Summary: Support sample pushdown in Data Source V2 Key: SPARK-25349 URL: https://issues.apache.org/jira/browse/SPARK-25349 Project: Spark Issue Type: Story Components: SQL Affects Versions: 3.0.0 Reporter: Xiangrui Meng Support sample pushdown would help file-based data source implementation save I/O cost significantly if it can decide whether to read a file or not. cc: [~cloud_fan] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25348) Data source for binary files
Xiangrui Meng created SPARK-25348: - Summary: Data source for binary files Key: SPARK-25348 URL: https://issues.apache.org/jira/browse/SPARK-25348 Project: Spark Issue Type: Story Components: ML, SQL Affects Versions: 3.0.0 Reporter: Xiangrui Meng It would be useful to have a data source implementation for binary files, which can be used to build features to load images, audio, and videos. Microsoft has an implementation at [https://github.com/Azure/mmlspark/tree/master/src/io/binary.] It would be great if we can merge it into Spark main repo. cc: [~mhamilton] and [~imatiach] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25347) Document image data sources in doc site
Xiangrui Meng created SPARK-25347: - Summary: Document image data sources in doc site Key: SPARK-25347 URL: https://issues.apache.org/jira/browse/SPARK-25347 Project: Spark Issue Type: Story Components: Documentation Affects Versions: 2.4.0 Reporter: Xiangrui Meng Currently, we only have Scala/Java API docs for image data source. It would be nice to have some documentation in the doc site. So Python/R users can also discover this feature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25346) Document Spark builtin data sources
[ https://issues.apache.org/jira/browse/SPARK-25346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-25346: -- Summary: Document Spark builtin data sources (was: Document Spark built-in data sources) > Document Spark builtin data sources > --- > > Key: SPARK-25346 > URL: https://issues.apache.org/jira/browse/SPARK-25346 > Project: Spark > Issue Type: Story > Components: Documentation >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Major > > It would be nice to list built-in data sources in the doc site. So users know > what are available by default. However, I didn't find any from 2.3.1 docs. > > cc: [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25346) Document Spark built-in data sources
[ https://issues.apache.org/jira/browse/SPARK-25346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-25346: -- Summary: Document Spark built-in data sources (was: Document Spark buit-in data sources) > Document Spark built-in data sources > > > Key: SPARK-25346 > URL: https://issues.apache.org/jira/browse/SPARK-25346 > Project: Spark > Issue Type: Story > Components: Documentation >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Priority: Major > > It would be nice to list built-in data sources in the doc site. So users know > what are available by default. However, I didn't find any from 2.3.1 docs. > > cc: [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25346) Document Spark buit-in data sources
Xiangrui Meng created SPARK-25346: - Summary: Document Spark buit-in data sources Key: SPARK-25346 URL: https://issues.apache.org/jira/browse/SPARK-25346 Project: Spark Issue Type: Story Components: Documentation Affects Versions: 2.4.0 Reporter: Xiangrui Meng It would be nice to list built-in data sources in the doc site. So users know what are available by default. However, I didn't find any from 2.3.1 docs. cc: [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25345) Deprecate public APIs from ImageSchema
Xiangrui Meng created SPARK-25345: - Summary: Deprecate public APIs from ImageSchema Key: SPARK-25345 URL: https://issues.apache.org/jira/browse/SPARK-25345 Project: Spark Issue Type: Story Components: ML Affects Versions: 2.4.0 Reporter: Xiangrui Meng After SPARK-22328, we can deprecate the public APIs in ImageSchema and remove them in Spark 3.0 (TODO: create JIRA). So users get a unified approach to load images w/ Spark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22666) Spark datasource for image format
[ https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-22666. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22328 [https://github.com/apache/spark/pull/22328] > Spark datasource for image format > - > > Key: SPARK-22666 > URL: https://issues.apache.org/jira/browse/SPARK-22666 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Timothy Hunter >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > The current API for the new image format is implemented as a standalone > feature, in order to make it reside within the mllib package. As discussed in > SPARK-21866, users should be able to load images through the more common > spark source reader interface. > This ticket is concerned with adding image reading support in the spark > source API, through either of the following interfaces: > - {{spark.read.format("image")...}} > - {{spark.read.image}} > The output is a dataframe that contains images (and the file names for > example), following the semantics discussed already in SPARK-21866. > A few technical notes: > * since the functionality is implemented in {{mllib}}, calling this function > may fail at runtime if users have not imported the {{spark-mllib}} dependency > * How to deal with very flat directories? It is common to have millions of > files in a single "directory" (like in S3), which seems to have caused some > issues to some users. If this issue is too complex to handle in this ticket, > it can be dealt with separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25344) Break large tests.py files into smaller files
Imran Rashid created SPARK-25344: Summary: Break large tests.py files into smaller files Key: SPARK-25344 URL: https://issues.apache.org/jira/browse/SPARK-25344 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.4.0 Reporter: Imran Rashid We've got a ton of tests in one humongous tests.py file, rather than breaking it out into smaller files. Having one huge file doesn't seem great for code organization, and it also makes the test parallelization in run-tests.py not work as well. On my laptop, tests.py takes 150s, and the next longest test file takes only 20s. There are similarly large files in other pyspark modules, eg. sql/tests.py, ml/tests.py, mllib/tests.py, streaming/tests.py. It seems that at least for some of these files, its already broken into independent test classes, so it shouldn't be too hard to just move them into their own files. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24360) Support Hive 3.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-24360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604748#comment-16604748 ] Dongjoon Hyun commented on SPARK-24360: --- [~toopt4]. Yep. We should support Hive 3.1 in this JIRA. > Support Hive 3.0 metastore > -- > > Key: SPARK-24360 > URL: https://issues.apache.org/jira/browse/SPARK-24360 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > Hive 3.0.0 is released. This issue aims to support Hive Metastore 3.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24360) Support Hive 3.1 metastore
[ https://issues.apache.org/jira/browse/SPARK-24360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24360: -- Summary: Support Hive 3.1 metastore (was: Support Hive 3.0 metastore) > Support Hive 3.1 metastore > -- > > Key: SPARK-24360 > URL: https://issues.apache.org/jira/browse/SPARK-24360 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > Hive 3.0.0 is released. This issue aims to support Hive Metastore 3.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25343) Extend CSV parsing to Dataset[List[String]]
[ https://issues.apache.org/jira/browse/SPARK-25343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Kemmer updated SPARK-25343: - Description: With the cvs() method it is currenty possible to create a Dataframe from Dataset[String], where the given string contains comma separated values. This is really great. But very often we have to parse files where we have to split the values of a line by very individual value separators and regular expressions. The result is a Dataset[List[String]]. This list corresponds to what you would get, after splitting the values of a CSV string at the separators. It would be great, if the csv() method would also accept such a Dataset as input especially given a target schema. The csv parser usually casts the separated values against the schema and can sort out lines where the values of the columns do not fit with the schema. This is especially interesting with PERMISSIVE mode and a column for corrupt records which then should contain the input list of strings as a dumped JSON string. This is the functionality I am looking for and I think the CSV parser is very close to it. was: With the cvs() method it is currenty possible to create a Dataframe from Dataset[String], where the given string contains comma separated values. This is really great. But very often we have to parse files where we have to split the values of a line by very individual value separators and regular expressions. The result is a Dataset[List[String]]. This list corresponds to what you would get, after splitting the values of a CSV string at the separators. It would be great, if the csv() method would also accept such a Dataset as input especially given a target schema. The csv parser usually casts the separated values against the schema and can sort out lines where the values of the columns do not fit with the schema. This is especially interesting with PERMISSIVE mode and a column for corrupt records which then should contain the input list of strings as a dumped JSON string. This is the functionality I am looking for and I think it is already implemented in the CSV parser. > Extend CSV parsing to Dataset[List[String]] > --- > > Key: SPARK-25343 > URL: https://issues.apache.org/jira/browse/SPARK-25343 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Frank Kemmer >Priority: Minor > > With the cvs() method it is currenty possible to create a Dataframe from > Dataset[String], where the given string contains comma separated values. This > is really great. > But very often we have to parse files where we have to split the values of a > line by very individual value separators and regular expressions. The result > is a Dataset[List[String]]. This list corresponds to what you would get, > after splitting the values of a CSV string at the separators. > It would be great, if the csv() method would also accept such a Dataset as > input especially given a target schema. The csv parser usually casts the > separated values against the schema and can sort out lines where the values > of the columns do not fit with the schema. > This is especially interesting with PERMISSIVE mode and a column for corrupt > records which then should contain the input list of strings as a dumped JSON > string. > This is the functionality I am looking for and I think the CSV parser is very > close to it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25343) Extend CSV parsing to Dataset[List[String]]
[ https://issues.apache.org/jira/browse/SPARK-25343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Kemmer updated SPARK-25343: - Description: With the cvs() method it is currenty possible to create a Dataframe from Dataset[String], where the given string contains comma separated values. This is really great. But very often we have to parse files where we have to split the values of a line by very individual value separators and regular expressions. The result is a Dataset[List[String]]. This list corresponds to what you would get, after splitting the values of a CSV string at the separators. It would be great, if the csv() method would also accept such a Dataset as input especially given a target schema. The csv parser usually casts the separated values against the schema and can sort out lines where the values of the columns do not fit with the schema. This is especially interesting with PERMISSIVE mode and a column for corrupt records which then should contain the input list of strings as a dumped JSON string. This is the functionality I am looking for and I think it is already implemented in the CSV parser. was: With the cvs() method it is currenty possible to create a Dataframe from Dataset[String], where the given string contains comma separated values. This is really great. But very often we have to parse files where we have to split the values of a line by very individual value separators and regular expressions. The result is a Dataset[List[String]]. This list corresponds to what you would get, after splitting the values of a CSV string. It would be great, if the csv() method would also accept such a Dataset as input especially given a target schema. The csv parser usually casts the separated values against the schema and can sort out lines where the values of the columns do not fit with the schema. This is especially interesting with PERMISSIVE mode and a column for corrupt records which then should contain the input list of strings as a dumped JSON string. This is the functionality I am looking for and I think it is already implemented in the CSV parser. > Extend CSV parsing to Dataset[List[String]] > --- > > Key: SPARK-25343 > URL: https://issues.apache.org/jira/browse/SPARK-25343 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Frank Kemmer >Priority: Minor > > With the cvs() method it is currenty possible to create a Dataframe from > Dataset[String], where the given string contains comma separated values. This > is really great. > But very often we have to parse files where we have to split the values of a > line by very individual value separators and regular expressions. The result > is a Dataset[List[String]]. This list corresponds to what you would get, > after splitting the values of a CSV string at the separators. > It would be great, if the csv() method would also accept such a Dataset as > input especially given a target schema. The csv parser usually casts the > separated values against the schema and can sort out lines where the values > of the columns do not fit with the schema. > This is especially interesting with PERMISSIVE mode and a column for corrupt > records which then should contain the input list of strings as a dumped JSON > string. > This is the functionality I am looking for and I think it is already > implemented in the CSV parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25343) Extend CSV parsing to Dataset[List[String]]
[ https://issues.apache.org/jira/browse/SPARK-25343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Kemmer updated SPARK-25343: - Description: With the cvs() method it is currenty possible to create a Dataframe from Dataset[String], where the given string contains comma separated values. This is really great. But very often we have to parse files where we have to split the values of a line by very individual value separators and regular expressions. The result is a Dataset[List[String]]. This list corresponds to what you would get, after splitting the values of a CSV string. It would be great, if the csv() method would also accept such a Dataset as input especially given a target schema. The csv parser usually casts the separated values against the schema and can sort out lines where the values of the columns do not fit with the schema. This is especially interesting with PERMISSIVE mode and a column for corrupt records which then should contain the input list of strings as a dumped JSON string. This is the functionality I am looking for and I think it is already implemented in the CSV parser. was: With the cvs() method it is currenty possible to create a Dataframe from Dataset[String], where the given string contains comma separated values. This is really great. But very often we have to parse files where we have to split the values of a line by very individual value separators and regular expressions. The result is a Dataset[List[String]]. This list corresponds to what you would get, after splitting the values of a CSV string. It would be great, if the csv() method would also accept such a Dataset as input especially given a target schema. The csv parser usually casts the separated values against the schema and can sort out lines where the values of the columns do not fit with the schema. This is the functionality I am looking for and I think it is already implemented in the CSV parser. > Extend CSV parsing to Dataset[List[String]] > --- > > Key: SPARK-25343 > URL: https://issues.apache.org/jira/browse/SPARK-25343 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Frank Kemmer >Priority: Minor > > With the cvs() method it is currenty possible to create a Dataframe from > Dataset[String], where the given string contains comma separated values. This > is really great. > But very often we have to parse files where we have to split the values of a > line by very individual value separators and regular expressions. The result > is a Dataset[List[String]]. This list corresponds to what you would get, > after splitting the values of a CSV string. > It would be great, if the csv() method would also accept such a Dataset as > input especially given a target schema. The csv parser usually casts the > separated values against the schema and can sort out lines where the values > of the columns do not fit with the schema. > This is especially interesting with PERMISSIVE mode and a column for corrupt > records which then should contain the input list of strings as a dumped JSON > string. > This is the functionality I am looking for and I think it is already > implemented in the CSV parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25339) Refactor FilterPushdownBenchmark to use main method
[ https://issues.apache.org/jira/browse/SPARK-25339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604738#comment-16604738 ] Dongjoon Hyun commented on SPARK-25339: --- Thank you for filing this in order not to forget this. I'm okay. If you want, you can work on this, [~yumwang]. > Refactor FilterPushdownBenchmark to use main method > --- > > Key: SPARK-25339 > URL: https://issues.apache.org/jira/browse/SPARK-25339 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > Wenchen commented on the PR: > https://github.com/apache/spark/pull/22336#issuecomment-418604019 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25343) Extend CSV parsing to Dataset[List[String]]
Frank Kemmer created SPARK-25343: Summary: Extend CSV parsing to Dataset[List[String]] Key: SPARK-25343 URL: https://issues.apache.org/jira/browse/SPARK-25343 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.1 Reporter: Frank Kemmer With the cvs() method it is currenty possible to create a Dataframe from Dataset[String], where the given string contains comma separated values. This is really great. But very often we have to parse files where we have to split the values of a line by very individual value separators and regular expressions. The result is a Dataset[List[String]]. This list corresponds to what you would get, after splitting the values of a CSV string. It would be great, if the csv() method would also accept such a Dataset as input especially given a target schema. The csv parser usually casts the separated values against the schema and can sort out lines where the values of the columns do not fit with the schema. This is the functionality I am looking for and I think it is already implemented in the CSV parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25342) Support rolling back a result stage
Wenchen Fan created SPARK-25342: --- Summary: Support rolling back a result stage Key: SPARK-25342 URL: https://issues.apache.org/jira/browse/SPARK-25342 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: Wenchen Fan This is a follow up of https://issues.apache.org/jira/browse/SPARK-23243 To completely fix that problem, Spark needs to be able to rollback a result stage and rerun all the result tasks. However, the result stage may do file committing, which does not support re-commit a task currently. We should either support to rollback a committed task, or abort the entire committing and do it again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25341) Support rolling back a shuffle map stage and re-generate the shuffle files
Wenchen Fan created SPARK-25341: --- Summary: Support rolling back a shuffle map stage and re-generate the shuffle files Key: SPARK-25341 URL: https://issues.apache.org/jira/browse/SPARK-25341 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: Wenchen Fan This is a follow up of https://issues.apache.org/jira/browse/SPARK-23243 To completely fix that problem, Spark needs to be able to rollback a shuffle map stage and rerun all the map tasks. According to https://github.com/apache/spark/pull/9214 , Spark doesn't support it currently, as in shuffle writing "first write wins". Since overwriting shuffle files is hard, we can extend the shuffle id to include a "shuffle generation number". Then the reduce task can specify which generation of shuffle it wants to read. https://github.com/apache/spark/pull/6648 seems in the right direction. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress
[ https://issues.apache.org/jira/browse/SPARK-24748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24748: Assignee: (was: Apache Spark) > Support for reporting custom metrics via Streaming Query Progress > - > > Key: SPARK-24748 > URL: https://issues.apache.org/jira/browse/SPARK-24748 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Arun Mahadevan >Priority: Major > > Currently the Structured Streaming sources and sinks does not have a way to > report custom metrics. Providing an option to report custom metrics and > making it available via Streaming Query progress can enable sources and sinks > to report custom progress information (E.g. the lag metrics for Kafka source). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress
[ https://issues.apache.org/jira/browse/SPARK-24748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24748: Assignee: Apache Spark > Support for reporting custom metrics via Streaming Query Progress > - > > Key: SPARK-24748 > URL: https://issues.apache.org/jira/browse/SPARK-24748 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Arun Mahadevan >Assignee: Apache Spark >Priority: Major > > Currently the Structured Streaming sources and sinks does not have a way to > report custom metrics. Providing an option to report custom metrics and > making it available via Streaming Query progress can enable sources and sinks > to report custom progress information (E.g. the lag metrics for Kafka source). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24539) HistoryServer does not display metrics from tasks that complete after stage failure
[ https://issues.apache.org/jira/browse/SPARK-24539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Gupta resolved SPARK-24539. - Resolution: Duplicate Resolving this as it has been fixed by SPARK-24415. > HistoryServer does not display metrics from tasks that complete after stage > failure > --- > > Key: SPARK-24539 > URL: https://issues.apache.org/jira/browse/SPARK-24539 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.1 >Reporter: Imran Rashid >Priority: Major > > I noticed that task metrics for completed tasks with a stage failure do not > show up in the new history server. I have a feeling this is because all of > the tasks succeeded *after* the stage had been failed (so they were > completions from a "zombie" taskset). The task metrics (eg. the shuffle read > size & shuffle write size) do not show up at all, either in the task table, > the executor table, or the overall stage summary metrics. (they might not > show up in the job summary page either, but in the event logs I have, there > is another successful stage attempt after this one, and that is the only > thing which shows up in the jobs page.) If you get task details from the api > endpoint (eg. > http://[host]:[port]/api/v1/applications/[app-id]/stages/[stage-id]/[stage-attempt]) > then you can see the successful tasks and all the metrics > Unfortunately the event logs I have are huge and I don't have a small repro > handy, but I hope that description is enough to go on. > I loaded the event logs I have in the SHS from spark 2.2 and they appear fine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24415) Stage page aggregated executor metrics wrong when failures
[ https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24415: -- Assignee: Ankur Gupta > Stage page aggregated executor metrics wrong when failures > --- > > Key: SPARK-24415 > URL: https://issues.apache.org/jira/browse/SPARK-24415 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Assignee: Ankur Gupta >Priority: Critical > Fix For: 2.4.0 > > Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png > > > Running with spark 2.3 on yarn and having task failures and blacklisting, the > aggregated metrics by executor are not correct. In my example it should have > 2 failed tasks but it only shows one. Note I tested with master branch to > verify its not fixed. > I will attach screen shot. > To reproduce: > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client > --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" > --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf > "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf > "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf > "spark.blacklist.killBlacklistedExecutors=true" > import org.apache.spark.SparkEnv > sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt > >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad > executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24415) Stage page aggregated executor metrics wrong when failures
[ https://issues.apache.org/jira/browse/SPARK-24415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-24415. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22209 [https://github.com/apache/spark/pull/22209] > Stage page aggregated executor metrics wrong when failures > --- > > Key: SPARK-24415 > URL: https://issues.apache.org/jira/browse/SPARK-24415 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Thomas Graves >Priority: Critical > Fix For: 2.4.0 > > Attachments: Screen Shot 2018-05-29 at 2.15.38 PM.png > > > Running with spark 2.3 on yarn and having task failures and blacklisting, the > aggregated metrics by executor are not correct. In my example it should have > 2 failed tasks but it only shows one. Note I tested with master branch to > verify its not fixed. > I will attach screen shot. > To reproduce: > $SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client > --executor-memory=2G --num-executors=1 --conf "spark.blacklist.enabled=true" > --conf "spark.blacklist.stage.maxFailedTasksPerExecutor=1" --conf > "spark.blacklist.stage.maxFailedExecutorsPerNode=1" --conf > "spark.blacklist.application.maxFailedTasksPerExecutor=2" --conf > "spark.blacklist.killBlacklistedExecutors=true" > import org.apache.spark.SparkEnv > sc.parallelize(1 to 1, 10).map \{ x => if (SparkEnv.get.executorId.toInt > >= 1 && SparkEnv.get.executorId.toInt <= 4) throw new RuntimeException("Bad > executor") else (x % 3, x) }.reduceByKey((a, b) => a + b).collect() -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec
[ https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604618#comment-16604618 ] Apache Spark commented on SPARK-14922: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/20999 > Alter Table Drop Partition Using Predicate-based Partition Spec > --- > > Key: SPARK-14922 > URL: https://issues.apache.org/jira/browse/SPARK-14922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.2, 2.2.1 >Reporter: Xiao Li >Priority: Major > > Below is allowed in Hive, but not allowed in Spark. > {noformat} > alter table ptestfilter drop partition (c='US', d<'2') > {noformat} > This example is copied from drop_partitions_filter.q -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25279) Throw exception: zzcclp java.io.NotSerializableException: org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc
[ https://issues.apache.org/jira/browse/SPARK-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhichao Zhang resolved SPARK-25279. Resolution: Won't Fix > Throw exception: zzcclp java.io.NotSerializableException: > org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc > --- > > Key: SPARK-25279 > URL: https://issues.apache.org/jira/browse/SPARK-25279 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.1 >Reporter: Zhichao Zhang >Priority: Minor > > Hi dev: > I am using Spark-Shell to run the example which is in section > '[http://spark.apache.org/docs/2.2.2/sql-programming-guide.html#type-safe-user-defined-aggregate-functions'], > > and there is an error: > {code:java} > Caused by: java.io.NotSerializableException: > org.apache.spark.sql.TypedColumn > Serialization stack: > - object not serializable (class: org.apache.spark.sql.TypedColumn, > value: > myaverage() AS `average_salary`) > - field (class: $iw, name: averageSalary, type: class > org.apache.spark.sql.TypedColumn) > - object (class $iw, $iw@4b2f8ae9) > - field (class: MyAverage$, name: $outer, type: class $iw) > - object (class MyAverage$, MyAverage$@2be41d90) > - field (class: > org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, > name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator) > - object (class > org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, > MyAverage(Employee)) > - field (class: > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, > name: aggregateFunction, type: class > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction) > - object (class > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, > partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class Employee)), > Some(class Employee), Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0)) > - writeObject data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.List$SerializationProxy, > scala.collection.immutable.List$SerializationProxy@5e92c46f) > - writeReplace data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.$colon$colon, > List(partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class > Employee)), Some(class Employee), > Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0))) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: > aggregateExpressions, type: interface scala.collection.Seq) > - object (class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, > ObjectHashAggregate(keys=[], > functions=[partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class > Employee)), Some(class Employee), > Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0)], output=[buf#37]) > +- *FileScan json [name#8,salary#9L] Batched: false, Format: JSON, Location: > InMemoryFileIndex[file:/opt/spark2/examples/src/main/resources/employees.json], > > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > ) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1, > > name: $outer, type: class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec) > - object (class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1, > > ) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2, > > name: $outer, type: class > org.apache.spark.sql.execution
[jira] [Closed] (SPARK-25279) Throw exception: zzcclp java.io.NotSerializableException: org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc
[ https://issues.apache.org/jira/browse/SPARK-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhichao Zhang closed SPARK-25279. -- > Throw exception: zzcclp java.io.NotSerializableException: > org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc > --- > > Key: SPARK-25279 > URL: https://issues.apache.org/jira/browse/SPARK-25279 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.1 >Reporter: Zhichao Zhang >Priority: Minor > > Hi dev: > I am using Spark-Shell to run the example which is in section > '[http://spark.apache.org/docs/2.2.2/sql-programming-guide.html#type-safe-user-defined-aggregate-functions'], > > and there is an error: > {code:java} > Caused by: java.io.NotSerializableException: > org.apache.spark.sql.TypedColumn > Serialization stack: > - object not serializable (class: org.apache.spark.sql.TypedColumn, > value: > myaverage() AS `average_salary`) > - field (class: $iw, name: averageSalary, type: class > org.apache.spark.sql.TypedColumn) > - object (class $iw, $iw@4b2f8ae9) > - field (class: MyAverage$, name: $outer, type: class $iw) > - object (class MyAverage$, MyAverage$@2be41d90) > - field (class: > org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, > name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator) > - object (class > org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, > MyAverage(Employee)) > - field (class: > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, > name: aggregateFunction, type: class > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction) > - object (class > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, > partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class Employee)), > Some(class Employee), Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0)) > - writeObject data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.List$SerializationProxy, > scala.collection.immutable.List$SerializationProxy@5e92c46f) > - writeReplace data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.$colon$colon, > List(partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class > Employee)), Some(class Employee), > Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0))) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: > aggregateExpressions, type: interface scala.collection.Seq) > - object (class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, > ObjectHashAggregate(keys=[], > functions=[partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class > Employee)), Some(class Employee), > Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0)], output=[buf#37]) > +- *FileScan json [name#8,salary#9L] Batched: false, Format: JSON, Location: > InMemoryFileIndex[file:/opt/spark2/examples/src/main/resources/employees.json], > > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > ) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1, > > name: $outer, type: class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec) > - object (class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1, > > ) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2, > > name: $outer, type: class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregate
[jira] [Commented] (SPARK-25279) Throw exception: zzcclp java.io.NotSerializableException: org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc
[ https://issues.apache.org/jira/browse/SPARK-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604604#comment-16604604 ] Zhichao Zhang commented on SPARK-25279: [~viirya], Thanks. I closed this issue. > Throw exception: zzcclp java.io.NotSerializableException: > org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc > --- > > Key: SPARK-25279 > URL: https://issues.apache.org/jira/browse/SPARK-25279 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.1 >Reporter: Zhichao Zhang >Priority: Minor > > Hi dev: > I am using Spark-Shell to run the example which is in section > '[http://spark.apache.org/docs/2.2.2/sql-programming-guide.html#type-safe-user-defined-aggregate-functions'], > > and there is an error: > {code:java} > Caused by: java.io.NotSerializableException: > org.apache.spark.sql.TypedColumn > Serialization stack: > - object not serializable (class: org.apache.spark.sql.TypedColumn, > value: > myaverage() AS `average_salary`) > - field (class: $iw, name: averageSalary, type: class > org.apache.spark.sql.TypedColumn) > - object (class $iw, $iw@4b2f8ae9) > - field (class: MyAverage$, name: $outer, type: class $iw) > - object (class MyAverage$, MyAverage$@2be41d90) > - field (class: > org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, > name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator) > - object (class > org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, > MyAverage(Employee)) > - field (class: > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, > name: aggregateFunction, type: class > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction) > - object (class > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, > partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class Employee)), > Some(class Employee), Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0)) > - writeObject data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.List$SerializationProxy, > scala.collection.immutable.List$SerializationProxy@5e92c46f) > - writeReplace data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.$colon$colon, > List(partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class > Employee)), Some(class Employee), > Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0))) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: > aggregateExpressions, type: interface scala.collection.Seq) > - object (class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, > ObjectHashAggregate(keys=[], > functions=[partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class > Employee)), Some(class Employee), > Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0)], output=[buf#37]) > +- *FileScan json [name#8,salary#9L] Batched: false, Format: JSON, Location: > InMemoryFileIndex[file:/opt/spark2/examples/src/main/resources/employees.json], > > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > ) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1, > > name: $outer, type: class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec) > - object (class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1, > > ) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfu
[jira] [Commented] (SPARK-25132) Case-insensitive field resolution when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604532#comment-16604532 ] Apache Spark commented on SPARK-25132: -- User 'seancxmao' has created a pull request for this issue: https://github.com/apache/spark/pull/22343 > Case-insensitive field resolution when reading from Parquet > --- > > Key: SPARK-25132 > URL: https://issues.apache.org/jira/browse/SPARK-25132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: Chenxiao Mao >Assignee: Chenxiao Mao >Priority: Major > Labels: Parquet > Fix For: 2.4.0 > > > Spark SQL returns NULL for a column whose Hive metastore schema and Parquet > schema are in different letter cases, regardless of spark.sql.caseSensitive > set to true or false. > Here is a simple example to reproduce this issue: > scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1") > spark-sql> show create table t1; > CREATE TABLE `t1` (`id` BIGINT) > USING parquet > OPTIONS ( > `serialization.format` '1' > ) > spark-sql> CREATE TABLE `t2` (`ID` BIGINT) > > USING parquet > > LOCATION 'hdfs://localhost/user/hive/warehouse/t1'; > spark-sql> select * from t1; > 0 > 1 > 2 > 3 > 4 > spark-sql> select * from t2; > NULL > NULL > NULL > NULL > NULL > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25132) Case-insensitive field resolution when reading from Parquet
[ https://issues.apache.org/jira/browse/SPARK-25132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604529#comment-16604529 ] Apache Spark commented on SPARK-25132: -- User 'seancxmao' has created a pull request for this issue: https://github.com/apache/spark/pull/22343 > Case-insensitive field resolution when reading from Parquet > --- > > Key: SPARK-25132 > URL: https://issues.apache.org/jira/browse/SPARK-25132 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.3.1 >Reporter: Chenxiao Mao >Assignee: Chenxiao Mao >Priority: Major > Labels: Parquet > Fix For: 2.4.0 > > > Spark SQL returns NULL for a column whose Hive metastore schema and Parquet > schema are in different letter cases, regardless of spark.sql.caseSensitive > set to true or false. > Here is a simple example to reproduce this issue: > scala> spark.range(5).toDF.write.mode("overwrite").saveAsTable("t1") > spark-sql> show create table t1; > CREATE TABLE `t1` (`id` BIGINT) > USING parquet > OPTIONS ( > `serialization.format` '1' > ) > spark-sql> CREATE TABLE `t2` (`ID` BIGINT) > > USING parquet > > LOCATION 'hdfs://localhost/user/hive/warehouse/t1'; > spark-sql> select * from t1; > 0 > 1 > 2 > 3 > 4 > spark-sql> select * from t2; > NULL > NULL > NULL > NULL > NULL > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23443) Spark with Glue as external catalog
[ https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604441#comment-16604441 ] Ameen Tayyebi commented on SPARK-23443: --- I've been sidetracked with lots of other projects, so at this time, I don't have bandwidth to work on this unfortunately :( :( > Spark with Glue as external catalog > --- > > Key: SPARK-23443 > URL: https://issues.apache.org/jira/browse/SPARK-23443 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Ameen Tayyebi >Priority: Major > > AWS Glue Catalog is an external Hive metastore backed by a web service. It > allows permanent storage of catalog data for BigData use cases. > To find out more information about AWS Glue, please consult: > * AWS Glue - [https://aws.amazon.com/glue/] > * Using Glue as a Metastore catalog for Spark - > [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html] > Today, the integration of Glue and Spark is through the Hive layer. Glue > implements the IMetaStore interface of Hive and for installations of Spark > that contain Hive, Glue can be used as the metastore. > The feature set that Glue supports does not align 1-1 with the set of > features that the latest version of Spark supports. For example, Glue > interface supports more advanced partition pruning that the latest version of > Hive embedded in Spark. > To enable a more natural integration with Spark and to allow leveraging > latest features of Glue, without being coupled to Hive, a direct integration > through Spark's own Catalog API is proposed. This Jira tracks this work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25228) Add executor CPU Time metric
[ https://issues.apache.org/jira/browse/SPARK-25228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25228. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22218 [https://github.com/apache/spark/pull/22218] > Add executor CPU Time metric > - > > Key: SPARK-25228 > URL: https://issues.apache.org/jira/browse/SPARK-25228 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 2.4.0 > > Attachments: Spark_Metric_executorCPUTIme_Grafana_dashboard.PNG > > > I propose to add a new metric to measure the executor's process CPU time. > This allows implementing monitoring of CPU resources used by Spark for > example using a Grafana dashboard, as in the attached example screenshot. > Note: this is similar and builds on top of the work in SPARK-22190. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25228) Add executor CPU Time metric
[ https://issues.apache.org/jira/browse/SPARK-25228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25228: - Assignee: Luca Canali > Add executor CPU Time metric > - > > Key: SPARK-25228 > URL: https://issues.apache.org/jira/browse/SPARK-25228 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 2.4.0 > > Attachments: Spark_Metric_executorCPUTIme_Grafana_dashboard.PNG > > > I propose to add a new metric to measure the executor's process CPU time. > This allows implementing monitoring of CPU resources used by Spark for > example using a Grafana dashboard, as in the attached example screenshot. > Note: this is similar and builds on top of the work in SPARK-22190. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604391#comment-16604391 ] Sean Owen commented on SPARK-18112: --- I don't know much about this part, but do we need Hive 2.x on the Spark (client) side in order to read from Hive 2.x metastore? Are you including Hive 2.x in your app? I don't know if that works. > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25340) Pushes down Sample beneath deterministic Project
[ https://issues.apache.org/jira/browse/SPARK-25340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-25340: - Description: If computations in Project are heavy (e.g., UDFs), it is useful to push down sample nodes into deterministic projects; {code} scala> spark.range(10).selectExpr("id + 3").sample(0.5).explain(true) // without this proposal == Analyzed Logical Plan == (id + 3): bigint Sample 0.0, 0.5, false, 3370873312340343855 +- Project [(id#0L + cast(3 as bigint)) AS (id + 3)#2L] +- Range (0, 10, step=1, splits=Some(4)) == Optimized Logical Plan == Sample 0.0, 0.5, false, 3370873312340343855 +- Project [(id#0L + 3) AS (id + 3)#2L] +- Range (0, 10, step=1, splits=Some(4)) // with this proposal == Optimized Logical Plan == Project [(id#0L + 3) AS (id + 3)#2L] +- Sample 0.0, 0.5, false, -6519017078291024113 +- Range (0, 10, step=1, splits=Some(4)) {code} POC: https://github.com/apache/spark/compare/master...maropu:SamplePushdown was: If computations in Project are heavy (e.g., UDFs), it is useful to push down sample nodes into deterministic projects; {code} scala> spark.range(10).selectExpr("id + 3").sample(0.5).explain(true) // without this proposal == Analyzed Logical Plan == (id + 3): bigint Sample 0.0, 0.5, false, 3370873312340343855 +- Project [(id#0L + cast(3 as bigint)) AS (id + 3)#2L] +- Range (0, 10, step=1, splits=Some(4)) == Optimized Logical Plan == Sample 0.0, 0.5, false, 3370873312340343855 +- Project [(id#0L + 3) AS (id + 3)#2L] +- Range (0, 10, step=1, splits=Some(4)) // with this proposal == Optimized Logical Plan == Project [(id#0L + 3) AS (id + 3)#2L] +- Sample 0.0, 0.5, false, -6519017078291024113 +- Range (0, 10, step=1, splits=Some(4)) {code} > Pushes down Sample beneath deterministic Project > > > Key: SPARK-25340 > URL: https://issues.apache.org/jira/browse/SPARK-25340 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > If computations in Project are heavy (e.g., UDFs), it is useful to push down > sample nodes into deterministic projects; > {code} > scala> spark.range(10).selectExpr("id + 3").sample(0.5).explain(true) > // without this proposal > == Analyzed Logical Plan == > (id + 3): bigint > Sample 0.0, 0.5, false, 3370873312340343855 > +- Project [(id#0L + cast(3 as bigint)) AS (id + 3)#2L] >+- Range (0, 10, step=1, splits=Some(4)) > == Optimized Logical Plan == > Sample 0.0, 0.5, false, 3370873312340343855 > +- Project [(id#0L + 3) AS (id + 3)#2L] >+- Range (0, 10, step=1, splits=Some(4)) > // with this proposal > == Optimized Logical Plan == > Project [(id#0L + 3) AS (id + 3)#2L] > +- Sample 0.0, 0.5, false, -6519017078291024113 >+- Range (0, 10, step=1, splits=Some(4)) > {code} > POC: https://github.com/apache/spark/compare/master...maropu:SamplePushdown -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25340) Pushes down Sample beneath deterministic Project
[ https://issues.apache.org/jira/browse/SPARK-25340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604390#comment-16604390 ] Takeshi Yamamuro commented on SPARK-25340: -- Is this feasible? [~smilegator] > Pushes down Sample beneath deterministic Project > > > Key: SPARK-25340 > URL: https://issues.apache.org/jira/browse/SPARK-25340 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > If computations in Project are heavy (e.g., UDFs), it is useful to push down > sample nodes into deterministic projects; > {code} > scala> spark.range(10).selectExpr("id + 3").sample(0.5).explain(true) > // without this proposal > == Analyzed Logical Plan == > (id + 3): bigint > Sample 0.0, 0.5, false, 3370873312340343855 > +- Project [(id#0L + cast(3 as bigint)) AS (id + 3)#2L] >+- Range (0, 10, step=1, splits=Some(4)) > == Optimized Logical Plan == > Sample 0.0, 0.5, false, 3370873312340343855 > +- Project [(id#0L + 3) AS (id + 3)#2L] >+- Range (0, 10, step=1, splits=Some(4)) > // with this proposal > == Optimized Logical Plan == > Project [(id#0L + 3) AS (id + 3)#2L] > +- Sample 0.0, 0.5, false, -6519017078291024113 >+- Range (0, 10, step=1, splits=Some(4)) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25340) Pushes down Sample beneath deterministic Project
Takeshi Yamamuro created SPARK-25340: Summary: Pushes down Sample beneath deterministic Project Key: SPARK-25340 URL: https://issues.apache.org/jira/browse/SPARK-25340 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.3.1 Reporter: Takeshi Yamamuro If computations in Project are heavy (e.g., UDFs), it is useful to push down sample nodes into deterministic projects; {code} scala> spark.range(10).selectExpr("id + 3").sample(0.5).explain(true) // without this proposal == Analyzed Logical Plan == (id + 3): bigint Sample 0.0, 0.5, false, 3370873312340343855 +- Project [(id#0L + cast(3 as bigint)) AS (id + 3)#2L] +- Range (0, 10, step=1, splits=Some(4)) == Optimized Logical Plan == Sample 0.0, 0.5, false, 3370873312340343855 +- Project [(id#0L + 3) AS (id + 3)#2L] +- Range (0, 10, step=1, splits=Some(4)) // with this proposal == Optimized Logical Plan == Project [(id#0L + 3) AS (id + 3)#2L] +- Sample 0.0, 0.5, false, -6519017078291024113 +- Range (0, 10, step=1, splits=Some(4)) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23443) Spark with Glue as external catalog
[ https://issues.apache.org/jira/browse/SPARK-23443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604331#comment-16604331 ] t oo commented on SPARK-23443: -- [~ameen.tayy...@gmail.com] any luck with the first PR? > Spark with Glue as external catalog > --- > > Key: SPARK-23443 > URL: https://issues.apache.org/jira/browse/SPARK-23443 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Ameen Tayyebi >Priority: Major > > AWS Glue Catalog is an external Hive metastore backed by a web service. It > allows permanent storage of catalog data for BigData use cases. > To find out more information about AWS Glue, please consult: > * AWS Glue - [https://aws.amazon.com/glue/] > * Using Glue as a Metastore catalog for Spark - > [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html] > Today, the integration of Glue and Spark is through the Hive layer. Glue > implements the IMetaStore interface of Hive and for installations of Spark > that contain Hive, Glue can be used as the metastore. > The feature set that Glue supports does not align 1-1 with the set of > features that the latest version of Spark supports. For example, Glue > interface supports more advanced partition pruning that the latest version of > Hive embedded in Spark. > To enable a more natural integration with Spark and to allow leveraging > latest features of Glue, without being coupled to Hive, a direct integration > through Spark's own Catalog API is proposed. This Jira tracks this work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence
[ https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604298#comment-16604298 ] Mathew commented on SPARK-24632: [~bryanc] that line is only there because we use the java object name to get the name of the python object to read, it is the bane of my life when developing external transformer packages and allowing them to support pipeline persistence. > Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers > for persistence > -- > > Key: SPARK-24632 > URL: https://issues.apache.org/jira/browse/SPARK-24632 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > This is a follow-up for [SPARK-17025], which allowed users to implement > Python PipelineStages in 3rd-party libraries, include them in Pipelines, and > use Pipeline persistence. This task is to make it easier for 3rd-party > libraries to have PipelineStages written in Java and then to use pyspark.ml > abstractions to create wrappers around those Java classes. This is currently > possible, except that users hit bugs around persistence. > I spent a bit thinking about this and wrote up thoughts and a proposal in the > doc linked below. Summary of proposal: > Require that 3rd-party libraries with Java classes with Python wrappers > implement a trait which provides the corresponding Python classpath in some > field: > {code} > trait PythonWrappable { > def pythonClassPath: String = … > } > MyJavaType extends PythonWrappable > {code} > This will not be required for MLlib wrappers, which we can handle specially. > One issue for this task will be that we may have trouble writing unit tests. > They would ideally test a Java class + Python wrapper class pair sitting > outside of pyspark. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24360) Support Hive 3.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-24360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604292#comment-16604292 ] t oo commented on SPARK-24360: -- [~dongjoon] Can this be merged to master? Also, can hive3.1 support be added easily? > Support Hive 3.0 metastore > -- > > Key: SPARK-24360 > URL: https://issues.apache.org/jira/browse/SPARK-24360 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > Hive 3.0.0 is released. This issue aims to support Hive Metastore 3.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25279) Throw exception: zzcclp java.io.NotSerializableException: org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc
[ https://issues.apache.org/jira/browse/SPARK-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604254#comment-16604254 ] Liang-Chi Hsieh edited comment on SPARK-25279 at 9/5/18 10:34 AM: -- The paste mode in REPL wraps pasted code as a single object and so the `TypedColumn` object is wrapped together. `TypedColumn` is not serializable. Seems to me this shouldn't be as a bug in Spark. was (Author: viirya): The paste mode in REPL wraps pasted code as a single object and so the `TypedColumn` object is wrapped together. `TypedColumn` is not serializable. > Throw exception: zzcclp java.io.NotSerializableException: > org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc > --- > > Key: SPARK-25279 > URL: https://issues.apache.org/jira/browse/SPARK-25279 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.1 >Reporter: Zhichao Zhang >Priority: Minor > > Hi dev: > I am using Spark-Shell to run the example which is in section > '[http://spark.apache.org/docs/2.2.2/sql-programming-guide.html#type-safe-user-defined-aggregate-functions'], > > and there is an error: > {code:java} > Caused by: java.io.NotSerializableException: > org.apache.spark.sql.TypedColumn > Serialization stack: > - object not serializable (class: org.apache.spark.sql.TypedColumn, > value: > myaverage() AS `average_salary`) > - field (class: $iw, name: averageSalary, type: class > org.apache.spark.sql.TypedColumn) > - object (class $iw, $iw@4b2f8ae9) > - field (class: MyAverage$, name: $outer, type: class $iw) > - object (class MyAverage$, MyAverage$@2be41d90) > - field (class: > org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, > name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator) > - object (class > org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, > MyAverage(Employee)) > - field (class: > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, > name: aggregateFunction, type: class > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction) > - object (class > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, > partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class Employee)), > Some(class Employee), Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0)) > - writeObject data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.List$SerializationProxy, > scala.collection.immutable.List$SerializationProxy@5e92c46f) > - writeReplace data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.$colon$colon, > List(partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class > Employee)), Some(class Employee), > Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0))) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: > aggregateExpressions, type: interface scala.collection.Seq) > - object (class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, > ObjectHashAggregate(keys=[], > functions=[partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class > Employee)), Some(class Employee), > Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0)], output=[buf#37]) > +- *FileScan json [name#8,salary#9L] Batched: false, Format: JSON, Location: > InMemoryFileIndex[file:/opt/spark2/examples/src/main/resources/employees.json], > > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > ) > - field (class: > org.apache.spark.sql.execution.aggregate.Ob
[jira] [Commented] (SPARK-25279) Throw exception: zzcclp java.io.NotSerializableException: org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc
[ https://issues.apache.org/jira/browse/SPARK-25279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604254#comment-16604254 ] Liang-Chi Hsieh commented on SPARK-25279: - The paste mode in REPL wraps pasted code as a single object and so the `TypedColumn` object is wrapped together. `TypedColumn` is not serializable. > Throw exception: zzcclp java.io.NotSerializableException: > org.apache.spark.sql.TypedColumn in Spark-shell when run example of doc > --- > > Key: SPARK-25279 > URL: https://issues.apache.org/jira/browse/SPARK-25279 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.2.1 >Reporter: Zhichao Zhang >Priority: Minor > > Hi dev: > I am using Spark-Shell to run the example which is in section > '[http://spark.apache.org/docs/2.2.2/sql-programming-guide.html#type-safe-user-defined-aggregate-functions'], > > and there is an error: > {code:java} > Caused by: java.io.NotSerializableException: > org.apache.spark.sql.TypedColumn > Serialization stack: > - object not serializable (class: org.apache.spark.sql.TypedColumn, > value: > myaverage() AS `average_salary`) > - field (class: $iw, name: averageSalary, type: class > org.apache.spark.sql.TypedColumn) > - object (class $iw, $iw@4b2f8ae9) > - field (class: MyAverage$, name: $outer, type: class $iw) > - object (class MyAverage$, MyAverage$@2be41d90) > - field (class: > org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, > name: aggregator, type: class org.apache.spark.sql.expressions.Aggregator) > - object (class > org.apache.spark.sql.execution.aggregate.ComplexTypedAggregateExpression, > MyAverage(Employee)) > - field (class: > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, > name: aggregateFunction, type: class > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateFunction) > - object (class > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression, > partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class Employee)), > Some(class Employee), Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0)) > - writeObject data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.List$SerializationProxy, > scala.collection.immutable.List$SerializationProxy@5e92c46f) > - writeReplace data (class: > scala.collection.immutable.List$SerializationProxy) > - object (class scala.collection.immutable.$colon$colon, > List(partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class > Employee)), Some(class Employee), > Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0))) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, name: > aggregateExpressions, type: interface scala.collection.Seq) > - object (class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec, > ObjectHashAggregate(keys=[], > functions=[partial_myaverage(MyAverage$@2be41d90, Some(newInstance(class > Employee)), Some(class Employee), > Some(StructType(StructField(name,StringType,true), > StructField(salary,LongType,false))), assertnotnull(assertnotnull(input[0, > Average, true])).sum AS sum#25L, assertnotnull(assertnotnull(input[0, > Average, true])).count AS count#26L, newInstance(class Average), input[0, > double, false] AS value#24, DoubleType, false, 0, 0)], output=[buf#37]) > +- *FileScan json [name#8,salary#9L] Batched: false, Format: JSON, Location: > InMemoryFileIndex[file:/opt/spark2/examples/src/main/resources/employees.json], > > PartitionFilters: [], PushedFilters: [], ReadSchema: > struct > ) > - field (class: > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1, > > name: $outer, type: class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec) > - object (class > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1, > > ) >
[jira] [Assigned] (SPARK-24889) dataset.unpersist() doesn't update storage memory stats
[ https://issues.apache.org/jira/browse/SPARK-24889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24889: Assignee: Apache Spark > dataset.unpersist() doesn't update storage memory stats > --- > > Key: SPARK-24889 > URL: https://issues.apache.org/jira/browse/SPARK-24889 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Yuri Bogomolov >Assignee: Apache Spark >Priority: Major > Attachments: image-2018-07-23-10-53-58-474.png > > > Steps to reproduce: > 1) Start a Spark cluster, and check the storage memory value from the Spark > Web UI "Executors" tab (it should be equal to zero if you just started) > 2) Run: > {code:java} > val df = spark.sqlContext.range(1, 10) > df.cache() > df.count() > df.unpersist(true){code} > 3) Check the storage memory value again, now it's equal to 1GB > > Looks like the memory is actually released, but stats aren't updated. This > issue makes cluster management more complicated. > !image-2018-07-23-10-53-58-474.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24889) dataset.unpersist() doesn't update storage memory stats
[ https://issues.apache.org/jira/browse/SPARK-24889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24889: Assignee: (was: Apache Spark) > dataset.unpersist() doesn't update storage memory stats > --- > > Key: SPARK-24889 > URL: https://issues.apache.org/jira/browse/SPARK-24889 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Yuri Bogomolov >Priority: Major > Attachments: image-2018-07-23-10-53-58-474.png > > > Steps to reproduce: > 1) Start a Spark cluster, and check the storage memory value from the Spark > Web UI "Executors" tab (it should be equal to zero if you just started) > 2) Run: > {code:java} > val df = spark.sqlContext.range(1, 10) > df.cache() > df.count() > df.unpersist(true){code} > 3) Check the storage memory value again, now it's equal to 1GB > > Looks like the memory is actually released, but stats aren't updated. This > issue makes cluster management more complicated. > !image-2018-07-23-10-53-58-474.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24889) dataset.unpersist() doesn't update storage memory stats
[ https://issues.apache.org/jira/browse/SPARK-24889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604225#comment-16604225 ] Apache Spark commented on SPARK-24889: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/22341 > dataset.unpersist() doesn't update storage memory stats > --- > > Key: SPARK-24889 > URL: https://issues.apache.org/jira/browse/SPARK-24889 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Yuri Bogomolov >Priority: Major > Attachments: image-2018-07-23-10-53-58-474.png > > > Steps to reproduce: > 1) Start a Spark cluster, and check the storage memory value from the Spark > Web UI "Executors" tab (it should be equal to zero if you just started) > 2) Run: > {code:java} > val df = spark.sqlContext.range(1, 10) > df.cache() > df.count() > df.unpersist(true){code} > 3) Check the storage memory value again, now it's equal to 1GB > > Looks like the memory is actually released, but stats aren't updated. This > issue makes cluster management more complicated. > !image-2018-07-23-10-53-58-474.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604176#comment-16604176 ] Hyukjin Kwon commented on SPARK-18112: -- Can you post reproducer step by step? did you set {{spark.sql.hive.metastore.version}} and jar properly? > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore
[ https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604167#comment-16604167 ] t oo commented on SPARK-18112: -- [~hyukjin.kwon] [~srowen] can this ticket be re-opened? This code is still in master as mentioned in comments above > Spark2.x does not support read data from Hive 2.x metastore > --- > > Key: SPARK-18112 > URL: https://issues.apache.org/jira/browse/SPARK-18112 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: KaiXu >Assignee: Xiao Li >Priority: Critical > Fix For: 2.2.0 > > > Hive2.0 has been released in February 2016, after that Hive2.0.1 and > Hive2.1.0 have also been released for a long time, but till now spark only > support to read hive metastore data from Hive1.2.1 and older version, since > Hive2.x has many bugs fixed and performance improvement it's better and > urgent to upgrade to support Hive2.x > failed to load data from hive2.x metastore: > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4 > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) > at > org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) > at > org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568) > at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13446) Spark need to support reading data from Hive 2.0.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-13446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604162#comment-16604162 ] t oo commented on SPARK-13446: -- [~cloud_fan] I am hitting same issue as [~elgalu] :( > Spark need to support reading data from Hive 2.0.0 metastore > > > Key: SPARK-13446 > URL: https://issues.apache.org/jira/browse/SPARK-13446 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Lifeng Wang >Assignee: Xiao Li >Priority: Major > Fix For: 2.2.0 > > > Spark provided HIveContext class to read data from hive metastore directly. > While it only supports hive 1.2.1 version and older. Since hive 2.0.0 has > released, it's better to upgrade to support Hive 2.0.0. > {noformat} > 16/02/23 02:35:02 INFO metastore: Trying to connect to metastore with URI > thrift://hsw-node13:9083 > 16/02/23 02:35:02 INFO metastore: Opened a connection to metastore, current > connections: 1 > 16/02/23 02:35:02 INFO metastore: Connected to metastore. > Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT > at > org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:473) > at > org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:192) > at > org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185) > at > org.apache.spark.sql.hive.HiveContext$$anon$1.(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:422) > at > org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:421) > at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:72) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:739) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:735) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17159) Improve FileInputDStream.findNewFiles list performance
[ https://issues.apache.org/jira/browse/SPARK-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604155#comment-16604155 ] Apache Spark commented on SPARK-17159: -- User 'ScrapCodes' has created a pull request for this issue: https://github.com/apache/spark/pull/22339 > Improve FileInputDStream.findNewFiles list performance > -- > > Key: SPARK-17159 > URL: https://issues.apache.org/jira/browse/SPARK-17159 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.0.0 > Environment: spark against object stores >Reporter: Steve Loughran >Priority: Minor > > {{FileInputDStream.findNewFiles()}} is doing a globStatus with a fitler that > calls getFileStatus() on every file, takes the output and does listStatus() > on the output. > This going to suffer on object stores, as dir listing and getFileStatus calls > are so expensive. It's clear this is a problem, as the method has code to > detect timeouts in the window and warn of problems. > It should be possible to make this faster -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasou
[ https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604154#comment-16604154 ] Apache Spark commented on SPARK-25337: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/22340 > HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: > org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;) > > > Key: SPARK-25337 > URL: https://issues.apache.org/jira/browse/SPARK-25337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > Observed in the Scala 2.12 pull request builder consistently now. I don't see > this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of > hard to see how. > CC [~sadhen] > {code:java} > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** > Exception encountered when invoking run on a nested suite - spark-submit > returned with exit code 1. > Command line: './bin/spark-submit' '--name' 'prepare testing tables' > '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' > 'spark.master.rest.enabled=false' '--conf' > 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' > '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' > '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' > > '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py' > ... > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/session.py", > line 545, in sql > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 63, in deco > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > 2018-09-04 20:00:04.95 - stdout> py4j.protocol.Py4JJavaError: An error > occurred while calling o27.sql. > 2018-09-04 20:00:04.95 - stdout> : java.util.ServiceConfigurationError: > org.apache.spark.sql.sources.DataSourceRegister: Provider > org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasour
[ https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25337: Assignee: Apache Spark > HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: > org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;) > > > Key: SPARK-25337 > URL: https://issues.apache.org/jira/browse/SPARK-25337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Major > > Observed in the Scala 2.12 pull request builder consistently now. I don't see > this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of > hard to see how. > CC [~sadhen] > {code:java} > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** > Exception encountered when invoking run on a nested suite - spark-submit > returned with exit code 1. > Command line: './bin/spark-submit' '--name' 'prepare testing tables' > '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' > 'spark.master.rest.enabled=false' '--conf' > 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' > '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' > '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' > > '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py' > ... > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/session.py", > line 545, in sql > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 63, in deco > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > 2018-09-04 20:00:04.95 - stdout> py4j.protocol.Py4JJavaError: An error > occurred while calling o27.sql. > 2018-09-04 20:00:04.95 - stdout> : java.util.ServiceConfigurationError: > org.apache.spark.sql.sources.DataSourceRegister: Provider > org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasour
[ https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25337: Assignee: (was: Apache Spark) > HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: > org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;) > > > Key: SPARK-25337 > URL: https://issues.apache.org/jira/browse/SPARK-25337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > Observed in the Scala 2.12 pull request builder consistently now. I don't see > this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of > hard to see how. > CC [~sadhen] > {code:java} > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** > Exception encountered when invoking run on a nested suite - spark-submit > returned with exit code 1. > Command line: './bin/spark-submit' '--name' 'prepare testing tables' > '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' > 'spark.master.rest.enabled=false' '--conf' > 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' > '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' > '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' > > '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py' > ... > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/session.py", > line 545, in sql > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 63, in deco > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > 2018-09-04 20:00:04.95 - stdout> py4j.protocol.Py4JJavaError: An error > occurred while calling o27.sql. > 2018-09-04 20:00:04.95 - stdout> : java.util.ServiceConfigurationError: > org.apache.spark.sql.sources.DataSourceRegister: Provider > org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16604146#comment-16604146 ] Apache Spark commented on SPARK-25317: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/22338 > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > eThere is a performance regression when calculating hash code for UTF8String: > {code:java} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code:java} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25317: Assignee: Apache Spark > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Blocker > > eThere is a performance regression when calculating hash code for UTF8String: > {code:java} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code:java} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org