[jira] [Commented] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException
[ https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033421#comment-17033421 ] Frederik Schreiber commented on SPARK-30711: Hey here is my setting: spark.sql.codegen.fallback: true (default) So yes the fallback works for me, but i`m wondering why the exception is print es error message with full stack trace if that is a regular path in the code. Shouldn`t that be an warning instead of an error message? So can i ignore this error message completely? How i can avoid that byte code grows up to fast over 64kb? thanks, Frederik > 64KB JVM bytecode limit - janino.InternalCompilerException > -- > > Key: SPARK-30711 > URL: https://issues.apache.org/jira/browse/SPARK-30711 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 > Environment: Windows 10 > Spark 2.4.4 > scalaVersion 2.11.12 > JVM Oracle 1.8.0_221-b11 >Reporter: Frederik Schreiber >Priority: Major > > Exception > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KBERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KB at > org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at > org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at > org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at > org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at > org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at >
[jira] [Created] (SPARK-30769) insertInto() with existing column as partition key cause weird partition result
Woong Seok Kang created SPARK-30769: --- Summary: insertInto() with existing column as partition key cause weird partition result Key: SPARK-30769 URL: https://issues.apache.org/jira/browse/SPARK-30769 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4 Environment: EMR 5.29.0 with Spark 2.4.4 Reporter: Woong Seok Kang {code:java} val tableName = s"${config.service}_$saveDatabase.${config.table}_partitioned" val writer = TableWriter.getWriter(tableDF.withColumn(config.dateColumn, typedLit[String](date.toString))) if (xsc.tableExistIn(config.service, saveDatabase, s"${config.table}_partitioned")) writer.insertInto(tableName) else writer.partitionBy(config.dateColumn).saveAsTable(tableName){code} This code checks whether table exists in desired path. (somewhere in S3 in this case) If table already exists in path then insert a new partition with insertInto() function. If config.dateColumn not exists in table schema, no problem occurred. (just new column will be added) but if it is already exists in schema, Spark does not use given column as a partition key, instead it will create a hundred of partitions. Below is a part of Spark logs: (Note that the name of partition column is date_ymd, which is already exists in source table. original value is a date string like '2020-01-01') 20/02/10 05:33:01 INFO S3NativeFileSystem2: rename s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=174 s3://\{my_path_at_s3}_partitioned_test/date_ymd=174 20/02/10 05:33:02 INFO S3NativeFileSystem2: rename s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=62 s3://\{my_path_at_s3}_partitioned_test/date_ymd=62 20/02/10 05:33:02 INFO S3NativeFileSystem2: rename s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=83 s3://\{my_path_at_s3}_partitioned_test/date_ymd=83 20/02/10 05:33:03 INFO S3NativeFileSystem2: rename s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=231 s3://\{my_path_at_s3}_partitioned_test/date_ymd=231 20/02/10 05:33:03 INFO S3NativeFileSystem2: rename s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=268 s3://\{my_path_at_s3}_partitioned_test/date_ymd=268 20/02/10 05:33:04 INFO S3NativeFileSystem2: rename s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=33 s3://\{my_path_at_s3}_partitioned_test/date_ymd=33 20/02/10 05:33:05 INFO S3NativeFileSystem2: rename s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=40 s3://\{my_path_at_s3}_partitioned_test/date_ymd=40 rename s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=__HIVE_DEFAULT_PARTITION__ s3://\{my_path_at_s3}_partitioned_test/date_ymd=__HIVE_DEFAULT_PARTITION__ When I use different partition key which not in table schema such as 'stamp_date', everything goes fine. I'm not sure that it is a Spark bugs, I just wrote the report. (I think it is related with Hive...) Thanks for reading! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30768) Constraints should be inferred from inequality attributes
[ https://issues.apache.org/jira/browse/SPARK-30768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-30768: Description: How to reproduce: {code:sql} create table SPARK_30768_1(c1 int, c2 int); create table SPARK_30768_2(c1 int, c2 int); {code} *Spark SQL*: {noformat} spark-sql> explain select t1.* from SPARK_30768_1 t1 join SPARK_30768_2 t2 on (t1.c1 > t2.c1) where t1.c1 = 3; == Physical Plan == *(3) Project [c1#5, c2#6] +- BroadcastNestedLoopJoin BuildRight, Inner, (c1#5 > c1#7) :- *(1) Project [c1#5, c2#6] : +- *(1) Filter (isnotnull(c1#5) AND (c1#5 = 3)) : +- *(1) ColumnarToRow :+- FileScan parquet default.spark_30768_1[c1#5,c2#6] Batched: true, DataFilters: [isnotnull(c1#5), (c1#5 = 3)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(c1), EqualTo(c1,3)], ReadSchema: struct +- BroadcastExchange IdentityBroadcastMode, [id=#60] +- *(2) Project [c1#7] +- *(2) Filter isnotnull(c1#7) +- *(2) ColumnarToRow +- FileScan parquet default.spark_30768_2[c1#7] Batched: true, DataFilters: [isnotnull(c1#7)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: struct {noformat} *Hive* support this feature: {noformat} hive> explain select t1.* from SPARK_30768_1 t1 join SPARK_30768_2 t2 on (t1.c1 > t2.c1) where t1.c1 = 3; Warning: Map Join MAPJOIN[13][bigTable=?] in task 'Stage-3:MAPRED' is a cross product OK STAGE DEPENDENCIES: Stage-4 is a root stage Stage-3 depends on stages: Stage-4 Stage-0 depends on stages: Stage-3 STAGE PLANS: Stage: Stage-4 Map Reduce Local Work Alias -> Map Local Tables: $hdt$_0:t1 Fetch Operator limit: -1 Alias -> Map Local Operator Tree: $hdt$_0:t1 TableScan alias: t1 filterExpr: (c1 = 3) (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: (c1 = 3) (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: c2 (type: int) outputColumnNames: _col1 Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE HashTable Sink Operator keys: 0 1 Stage: Stage-3 Map Reduce Map Operator Tree: TableScan alias: t2 filterExpr: (c1 < 3) (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: (c1 < 3) (type: boolean) Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Select Operator Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 1 outputColumnNames: _col1 Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL Column stats: NONE Select Operator expressions: 3 (type: int), _col1 (type: int) outputColumnNames: _col0, _col1 Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Execution mode: vectorized Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink Time taken: 5.491 seconds, Fetched: 71 row(s) {noformat} > Constraints should be inferred from inequality attributes > - > > Key: SPARK-30768 > URL: https://issues.apache.org/jira/browse/SPARK-30768 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority:
[jira] [Created] (SPARK-30768) Constraints should be inferred from inequality attributes
Yuming Wang created SPARK-30768: --- Summary: Constraints should be inferred from inequality attributes Key: SPARK-30768 URL: https://issues.apache.org/jira/browse/SPARK-30768 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF
[ https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033376#comment-17033376 ] Liang Zhang commented on SPARK-30762: - Hi Hyukjin, I updated the description. > Add dtype="float32" support to vector_to_array UDF > -- > > Key: SPARK-30762 > URL: https://issues.apache.org/jira/browse/SPARK-30762 > Project: Spark > Issue Type: Story > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Liang Zhang >Assignee: Liang Zhang >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > Previous PR: > [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py] > In the previous PR, we introduced a UDF to convert a column of MLlib Vecters > to a column of lists in python (Seq in scala). Currently, all the floating > numbers in a vector is converted to Double in scala. In this issue, we will > add a parameter in the python function {{vector_to_array(col)}} that allows > converting to Float (32bits) in scala, which would be mapped to a numpy array > of dtype=float32. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF
[ https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang Zhang updated SPARK-30762: Description: Previous PR: [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py] In the previous PR, we introduced a UDF to convert a column of MLlib Vecters to a column of lists in python (Seq in scala). Currently, all the floating numbers in a vector is converted to Double in scala. In this issue, we will add a parameter in the python function {{vector_to_array(col)}} that allows converting to Float (32bits) in scala, which would be mapped to a numpy array of dtype=float32. was: Previous PR: [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py] > Add dtype="float32" support to vector_to_array UDF > -- > > Key: SPARK-30762 > URL: https://issues.apache.org/jira/browse/SPARK-30762 > Project: Spark > Issue Type: Story > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Liang Zhang >Assignee: Liang Zhang >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > Previous PR: > [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py] > In the previous PR, we introduced a UDF to convert a column of MLlib Vecters > to a column of lists in python (Seq in scala). Currently, all the floating > numbers in a vector is converted to Double in scala. In this issue, we will > add a parameter in the python function {{vector_to_array(col)}} that allows > converting to Float (32bits) in scala, which would be mapped to a numpy array > of dtype=float32. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30619) org.slf4j.Logger and org.apache.commons.collections classes not built as part of hadoop-provided profile
[ https://issues.apache.org/jira/browse/SPARK-30619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033374#comment-17033374 ] Abhishek Rao commented on SPARK-30619: -- Hi [~hyukjin.kwon] I just built container using spark-2.4.4-bin-without-hadoop.tgz. Here is the spark-submit command that I used. ./spark-submit --deploy-mode cluster --class org.apache.spark.examples.JavaWordCount --master k8s://https:// --name spark-test --conf spark.kubernetes.container.image= --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa --conf spark.kubernetes.namespace=spark local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar file:///opt/spark/RELEASE > org.slf4j.Logger and org.apache.commons.collections classes not built as part > of hadoop-provided profile > > > Key: SPARK-30619 > URL: https://issues.apache.org/jira/browse/SPARK-30619 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.2, 2.4.4 > Environment: Spark on kubernetes >Reporter: Abhishek Rao >Priority: Major > > We're using spark-2.4.4-bin-without-hadoop.tgz and executing Java Word count > (org.apache.spark.examples.JavaWordCount) example on local files. > But we're seeing that it is expecting org.slf4j.Logger and > org.apache.commons.collections classes to be available for executing this. > We expected the binary to work as it is for local files. Is there anything > which we're missing? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF
[ https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-30762: - Assignee: Liang Zhang > Add dtype="float32" support to vector_to_array UDF > -- > > Key: SPARK-30762 > URL: https://issues.apache.org/jira/browse/SPARK-30762 > Project: Spark > Issue Type: Story > Components: MLlib >Affects Versions: 3.0.0 >Reporter: Liang Zhang >Assignee: Liang Zhang >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > Previous PR: > [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF
[ https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-30762: -- Component/s: PySpark > Add dtype="float32" support to vector_to_array UDF > -- > > Key: SPARK-30762 > URL: https://issues.apache.org/jira/browse/SPARK-30762 > Project: Spark > Issue Type: Story > Components: MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: Liang Zhang >Assignee: Liang Zhang >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > Previous PR: > [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-29721) Spark SQL reads unnecessary nested fields after using explode
[ https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reopened SPARK-29721: - > Spark SQL reads unnecessary nested fields after using explode > - > > Key: SPARK-29721 > URL: https://issues.apache.org/jira/browse/SPARK-29721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0 >Reporter: Kai Kang >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.0.0 > > > This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column > pruning for nested structures. However, when explode() is called on a nested > field, all columns for that nested structure is still fetched from data > source. > We are working on a project to create a parquet store for a big pre-joined > table between two tables that has one-to-many relationship, and this is a > blocking issue for us. > > The following code illustrates the issue. > Part 1: loading some nested data > {noformat} > val jsonStr = """{ > "items": [ >{"itemId": 1, "itemData": "a"}, >{"itemId": 2, "itemData": "b"} > ] > }""" > val df = spark.read.json(Seq(jsonStr).toDS) > df.write.format("parquet").mode("overwrite").saveAsTable("persisted") > {noformat} > > Part 2: reading it back and explaining the queries > {noformat} > val read = spark.table("persisted") > spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true) > // pruned, only loading itemId > // ReadSchema: struct>> > read.select($"items.itemId").explain(true) > // not pruned, loading both itemId > // ReadSchema: struct>> > read.select(explode($"items.itemId")).explain(true) and itemData > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30614) The native ALTER COLUMN syntax should change one thing at a time
[ https://issues.apache.org/jira/browse/SPARK-30614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30614. - Fix Version/s: 3.0.0 Assignee: Terry Kim Resolution: Fixed > The native ALTER COLUMN syntax should change one thing at a time > > > Key: SPARK-30614 > URL: https://issues.apache.org/jira/browse/SPARK-30614 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > Our native ALTER COLUMN syntax is newly added in 3.0 and almost follows the > SQL standard. > {code} > ALTER TABLE table=multipartIdentifier > (ALTER | CHANGE) COLUMN? column=multipartIdentifier > (TYPE dataType)? > (COMMENT comment=STRING)? > colPosition? > {code} > The SQL standard (section 11.12) only allows changing one property at a time. > This is also true on other recent SQL systems like > snowflake(https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html) > and > redshift(https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html) > The snowflake has an extension that it allows changing multiple columns at a > time, like ALTER COLUMN c1 TYPE int, c2 TYPE int. If we want to extend the > SQL standard, I think this syntax is better. > For now, let's be conservative and only allow changing one property at a time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30732) BroadcastExchangeExec does not fully honor "spark.broadcast.compress"
[ https://issues.apache.org/jira/browse/SPARK-30732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30732. -- Resolution: Won't Fix > BroadcastExchangeExec does not fully honor "spark.broadcast.compress" > - > > Key: SPARK-30732 > URL: https://issues.apache.org/jira/browse/SPARK-30732 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Puneet >Priority: Major > > Setting {{spark.broadcast.compress}} to false disables compression while > sending broadcast variable to executors > ([https://spark.apache.org/docs/latest/configuration.html#compression-and-serialization]) > However this does not disable compression for any child relations sent by the > executors to the driver. > Setting spark.boradcast.compress to false should disable both sides of the > traffic, allowing users to disable compression for the whole broadcast join > for example. > [https://github.com/puneetguptanitj/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L89] > ^here `executeCollectIterator` calls `getByteArrayRdd` which by default > always gets a compressed stream > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30741) The data returned from SAS using JDBC reader contains column label
[ https://issues.apache.org/jira/browse/SPARK-30741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033350#comment-17033350 ] Hyukjin Kwon commented on SPARK-30741: -- Spark 2.1.x is EOL. Can you try it in a higher version? Also, if possible, please provide reproducible steps; otherwise, no one can verify it. > The data returned from SAS using JDBC reader contains column label > -- > > Key: SPARK-30741 > URL: https://issues.apache.org/jira/browse/SPARK-30741 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark >Affects Versions: 2.1.1 >Reporter: Gary Liu >Priority: Major > Attachments: SparkBug.png > > > When read SAS data using JDBC with SAS SHARE driver, the returned data > contains column labels, rather data. > According to testing result from SAS Support, the results are correct using > Java. So they believe it is due to spark reading. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30740) months_between wrong calculation
[ https://issues.apache.org/jira/browse/SPARK-30740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-30740: - Priority: Major (was: Critical) > months_between wrong calculation > > > Key: SPARK-30740 > URL: https://issues.apache.org/jira/browse/SPARK-30740 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: nhufas >Priority: Major > > months_between not calculating right for February > example > > {{select }} > {{ months_between('2020-02-29','2019-12-29')}} > {{,months_between('2020-02-29','2019-12-30') }} > {{,months_between('2020-02-29','2019-12-31') }} > > will generate a result like this > |2|1.96774194|2| > > For 2019-12-30 is calculating wrong. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30741) The data returned from SAS using JDBC reader contains column label
[ https://issues.apache.org/jira/browse/SPARK-30741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30741. -- Resolution: Incomplete > The data returned from SAS using JDBC reader contains column label > -- > > Key: SPARK-30741 > URL: https://issues.apache.org/jira/browse/SPARK-30741 > Project: Spark > Issue Type: Bug > Components: Input/Output, PySpark >Affects Versions: 2.1.1 >Reporter: Gary Liu >Priority: Major > Attachments: SparkBug.png > > > When read SAS data using JDBC with SAS SHARE driver, the returned data > contains column labels, rather data. > According to testing result from SAS Support, the results are correct using > Java. So they believe it is due to spark reading. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30745) Spark streaming, kafka broker error, "Failed to get records for spark-executor- .... after polling for 512"
[ https://issues.apache.org/jira/browse/SPARK-30745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30745. -- Resolution: Incomplete > Spark streaming, kafka broker error, "Failed to get records for > spark-executor- after polling for 512" > --- > > Key: SPARK-30745 > URL: https://issues.apache.org/jira/browse/SPARK-30745 > Project: Spark > Issue Type: Bug > Components: Build, Deploy, DStreams, Kubernetes >Affects Versions: 2.0.2 > Environment: Spark 2.0.2, Kafka 0.10 >Reporter: Harneet K >Priority: Major > Labels: Spark2.0.2, cluster, kafka-0.10, spark-streaming-kafka > > We have a spark streaming application reading data from Kafka. > Data size: 15 Million > Below errors were seen: > java.lang.AssertionError: assertion failed: Failed to get records for > spark-executor- after polling for 512 at > scala.Predef$.assert(Predef.scala:170) > There were more errors seen pertaining to CachedKafkaConsumer > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > The spark.streaming.kafka.consumer.poll.ms is set to default 512ms and other > Kafka stream timeout settings are default. > "request.timeout.ms" > "heartbeat.interval.ms" > "session.timeout.ms" > "max.poll.interval.ms" > Also, the kafka is being recently updated to 0.10 from 0.8. In 0.8, this > behavior was not seen. > No resource issues are seen. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30745) Spark streaming, kafka broker error, "Failed to get records for spark-executor- .... after polling for 512"
[ https://issues.apache.org/jira/browse/SPARK-30745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033348#comment-17033348 ] Hyukjin Kwon commented on SPARK-30745: -- Spark 2.0.x is EOL. Can you try it in a higher version? > Spark streaming, kafka broker error, "Failed to get records for > spark-executor- after polling for 512" > --- > > Key: SPARK-30745 > URL: https://issues.apache.org/jira/browse/SPARK-30745 > Project: Spark > Issue Type: Bug > Components: Build, Deploy, DStreams, Kubernetes >Affects Versions: 2.0.2 > Environment: Spark 2.0.2, Kafka 0.10 >Reporter: Harneet K >Priority: Major > Labels: Spark2.0.2, cluster, kafka-0.10, spark-streaming-kafka > > We have a spark streaming application reading data from Kafka. > Data size: 15 Million > Below errors were seen: > java.lang.AssertionError: assertion failed: Failed to get records for > spark-executor- after polling for 512 at > scala.Predef$.assert(Predef.scala:170) > There were more errors seen pertaining to CachedKafkaConsumer > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > The spark.streaming.kafka.consumer.poll.ms is set to default 512ms and other > Kafka stream timeout settings are default. > "request.timeout.ms" > "heartbeat.interval.ms" > "session.timeout.ms" > "max.poll.interval.ms" > Also, the kafka is being recently updated to 0.10 from 0.8. In 0.8, this > behavior was not seen. > No resource issues are seen. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF
[ https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033346#comment-17033346 ] Hyukjin Kwon commented on SPARK-30762: -- Sorry, can you clarify what you mean by dtype support? > Add dtype="float32" support to vector_to_array UDF > -- > > Key: SPARK-30762 > URL: https://issues.apache.org/jira/browse/SPARK-30762 > Project: Spark > Issue Type: Story > Components: MLlib >Affects Versions: 3.0.0 >Reporter: Liang Zhang >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > Previous PR: > [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30767) from_json changes times of timestmaps by several minutes without error
[ https://issues.apache.org/jira/browse/SPARK-30767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30767. -- Resolution: Not A Problem > from_json changes times of timestmaps by several minutes without error > --- > > Key: SPARK-30767 > URL: https://issues.apache.org/jira/browse/SPARK-30767 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 > Environment: We ran the example code with Spark 2.4.4 via Azure > Databricks with Databricks Runtime version 6.3 within an interactive cluster. > We encountered the issue first on a Job Cluster running a streaming > application on Databricks Runtime Version 5.4. >Reporter: Benedikt Maria Beckermann >Priority: Major > > When a json text column includes a timestamp and the timestamp has a format > like {{2020-01-25T06:39:45.887429Z}}, the function > {{from_json(Column,StructType)}} is able to infer a timestamp but that > timestamp is changed by several minutes. > Spark does not throw any kind of error but continues to run with the > invalidated timestamp. > The following scala snipped is able to reproduce the issue. > > {code:scala} > import org.apache.spark.sql._ > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val df = Seq("""{"time":"2020-01-25T06:39:45.887429Z"}""").toDF("json") > val struct = new StructType().add("time", TimestampType, nullable = true) > val timeDF = df > .withColumn("time (string)", get_json_object(col("json"), "$.time")) > .withColumn("time casted directly (CORRECT)", col("time > (string)").cast(TimestampType)) > .withColumn("time casted via struct (INVALID)", from_json(col("json"), > struct)) > display(timeDF) > {code} > Output: > ||json||time (string)||time casted directly (CORRECT)||time casted via struct > (INVALID) > |{"time":"2020-01-25T06:39:45.887429Z"}|2020-01-25T06:39:45.887429Z|2020-01-25T06:39:45.887+|{"time":"2020-01-25T06:54:32.429+"} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30767) from_json changes times of timestmaps by several minutes without error
[ https://issues.apache.org/jira/browse/SPARK-30767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-30767: - Labels: (was: corruption) > from_json changes times of timestmaps by several minutes without error > --- > > Key: SPARK-30767 > URL: https://issues.apache.org/jira/browse/SPARK-30767 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 > Environment: We ran the example code with Spark 2.4.4 via Azure > Databricks with Databricks Runtime version 6.3 within an interactive cluster. > We encountered the issue first on a Job Cluster running a streaming > application on Databricks Runtime Version 5.4. >Reporter: Benedikt Maria Beckermann >Priority: Major > > When a json text column includes a timestamp and the timestamp has a format > like {{2020-01-25T06:39:45.887429Z}}, the function > {{from_json(Column,StructType)}} is able to infer a timestamp but that > timestamp is changed by several minutes. > Spark does not throw any kind of error but continues to run with the > invalidated timestamp. > The following scala snipped is able to reproduce the issue. > > {code:scala} > import org.apache.spark.sql._ > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val df = Seq("""{"time":"2020-01-25T06:39:45.887429Z"}""").toDF("json") > val struct = new StructType().add("time", TimestampType, nullable = true) > val timeDF = df > .withColumn("time (string)", get_json_object(col("json"), "$.time")) > .withColumn("time casted directly (CORRECT)", col("time > (string)").cast(TimestampType)) > .withColumn("time casted via struct (INVALID)", from_json(col("json"), > struct)) > display(timeDF) > {code} > Output: > ||json||time (string)||time casted directly (CORRECT)||time casted via struct > (INVALID) > |{"time":"2020-01-25T06:39:45.887429Z"}|2020-01-25T06:39:45.887429Z|2020-01-25T06:39:45.887+|{"time":"2020-01-25T06:54:32.429+"} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException
[ https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30711. -- Resolution: Cannot Reproduce > 64KB JVM bytecode limit - janino.InternalCompilerException > -- > > Key: SPARK-30711 > URL: https://issues.apache.org/jira/browse/SPARK-30711 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4 > Environment: Windows 10 > Spark 2.4.4 > scalaVersion 2.11.12 > JVM Oracle 1.8.0_221-b11 >Reporter: Frederik Schreiber >Priority: Major > > Exception > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KBERROR CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Code of method "processNext()V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4" > grows beyond 64 KB at > org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at > org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at > org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369) > at > org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at > org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at > org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at > org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at > org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) > at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at > org.apache.spark.sql.Dataset.collect(Dataset.scala:2783) at >
[jira] [Resolved] (SPARK-30687) When reading from a file with pre-defined schema and encountering a single value that is not the same type as that of its column , Spark nullifies the entire row
[ https://issues.apache.org/jira/browse/SPARK-30687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30687. -- Resolution: Cannot Reproduce > When reading from a file with pre-defined schema and encountering a single > value that is not the same type as that of its column , Spark nullifies the > entire row > - > > Key: SPARK-30687 > URL: https://issues.apache.org/jira/browse/SPARK-30687 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bao Nguyen >Priority: Major > > When reading from a file with pre-defined schema and encountering a single > value that is not the same type as that of its column , Spark nullifies the > entire row instead of setting the value at that cell to be null. > > {code:java} > case class TestModel( > num: Double, test: String, mac: String, value: Double > ) > val schema = > ScalaReflection.schemaFor[TestModel].dataType.asInstanceOf[StructType] > //here's the content of the file test.data > //1~test~mac1~2 > //1.0~testdatarow2~mac2~non-numeric > //2~test1~mac1~3 > val ds = spark > .read > .schema(schema) > .option("delimiter", "~") > .csv("/test-data/test.data") > ds.show(); > //the content of data frame. second row is all null. > // ++-++-+ > // | num| test| mac|value| > // ++-++-+ > // | 1.0| test|mac1| 2.0| > // |null| null|null| null| > // | 2.0|test1|mac1| 3.0| > // ++-++-+ > //should be > // ++--++-+ > // | num| test | mac|value| > // ++--++-+ > // | 1.0| test |mac1| 2.0 | > // |1.0 |testdatarow2 |mac2| null| > // | 2.0|test1 |mac1| 3.0 | > // ++--++-+{code} > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
[ https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-30688: - Affects Version/s: 3.0.0 2.4.4 > Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF > -- > > Key: SPARK-30688 > URL: https://issues.apache.org/jira/browse/SPARK-30688 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.4, 3.0.0 >Reporter: Rajkumar Singh >Priority: Major > > > {code:java} > scala> spark.sql("select unix_timestamp('20201', 'ww')").show(); > +-+ > |unix_timestamp(20201, ww)| > +-+ > | null| > +-+ > > scala> spark.sql("select unix_timestamp('20202', 'ww')").show(); > -+ > |unix_timestamp(20202, ww)| > +-+ > | 1578182400| > +-+ > > {code} > > > This seems to happen for leap year only, I dig deeper into it and it seems > that Spark is using the java.text.SimpleDateFormat and try to parse the > expression here > [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652] > {code:java} > formatter.parse( > t.asInstanceOf[UTF8String].toString).getTime / 1000L{code} > but fail and SimpleDateFormat unable to parse the date throw Unparseable > Exception but Spark handle it silently and returns NULL. > > *Spark-3.0:* I did some tests where spark no longer using the legacy > java.text.SimpleDateFormat but java date/time API, it seems date/time API > expect a valid date with valid format > org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30619) org.slf4j.Logger and org.apache.commons.collections classes not built as part of hadoop-provided profile
[ https://issues.apache.org/jira/browse/SPARK-30619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30619. -- Resolution: Cannot Reproduce > org.slf4j.Logger and org.apache.commons.collections classes not built as part > of hadoop-provided profile > > > Key: SPARK-30619 > URL: https://issues.apache.org/jira/browse/SPARK-30619 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.2, 2.4.4 > Environment: Spark on kubernetes >Reporter: Abhishek Rao >Priority: Major > > We're using spark-2.4.4-bin-without-hadoop.tgz and executing Java Word count > (org.apache.spark.examples.JavaWordCount) example on local files. > But we're seeing that it is expecting org.slf4j.Logger and > org.apache.commons.collections classes to be available for executing this. > We expected the binary to work as it is for local files. Is there anything > which we're missing? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30619) org.slf4j.Logger and org.apache.commons.collections classes not built as part of hadoop-provided profile
[ https://issues.apache.org/jira/browse/SPARK-30619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033345#comment-17033345 ] Hyukjin Kwon commented on SPARK-30619: -- [~abhisrao], can you show the exact step you did so I can follow? I can't reproduce. > org.slf4j.Logger and org.apache.commons.collections classes not built as part > of hadoop-provided profile > > > Key: SPARK-30619 > URL: https://issues.apache.org/jira/browse/SPARK-30619 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.2, 2.4.4 > Environment: Spark on kubernetes >Reporter: Abhishek Rao >Priority: Major > > We're using spark-2.4.4-bin-without-hadoop.tgz and executing Java Word count > (org.apache.spark.examples.JavaWordCount) example on local files. > But we're seeing that it is expecting org.slf4j.Logger and > org.apache.commons.collections classes to be available for executing this. > We expected the binary to work as it is for local files. Is there anything > which we're missing? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled
[ https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-28067: - Priority: Critical (was: Blocker) > Incorrect results in decimal aggregation with whole-stage code gen enabled > -- > > Key: SPARK-28067 > URL: https://issues.apache.org/jira/browse/SPARK-28067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Mark Sirek >Priority: Critical > Labels: correctness > > The following test case involving a join followed by a sum aggregation > returns the wrong answer for the sum: > > {code:java} > val df = Seq( > (BigDecimal("1000"), 1), > (BigDecimal("1000"), 1), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2)).toDF("decNum", "intNum") > val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, > "intNum").agg(sum("decNum")) > scala> df2.show(40,false) > --- > sum(decNum) > --- > 4000.00 > --- > > {code} > > The result should be 104000.. > It appears a partial sum is computed for each join key, as the result > returned would be the answer for all rows matching intNum === 1. > If only the rows with intNum === 2 are included, the answer given is null: > > {code:java} > scala> val df3 = df.filter($"intNum" === lit(2)) > df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: > decimal(38,18), intNum: int] > scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, > "intNum").agg(sum("decNum")) > df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)] > scala> df4.show(40,false) > --- > sum(decNum) > --- > null > --- > > {code} > > The correct answer, 10., doesn't fit in > the DataType picked for the result, decimal(38,18), so an overflow occurs, > which Spark then converts to null. > The first example, which doesn't filter out the intNum === 1 values should > also return null, indicating overflow, but it doesn't. This may mislead the > user to think a valid sum was computed. > If whole-stage code gen is turned off: > spark.conf.set("spark.sql.codegen.wholeStage", false) > ... incorrect results are not returned because the overflow is caught as an > exception: > java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 > exceeds max precision 38 > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled
[ https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033344#comment-17033344 ] Hyukjin Kwon commented on SPARK-28067: -- I am lowering the priority given that {{spark.sql.codegen.wholeStage}} defaults to {{true}}. > Incorrect results in decimal aggregation with whole-stage code gen enabled > -- > > Key: SPARK-28067 > URL: https://issues.apache.org/jira/browse/SPARK-28067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Mark Sirek >Priority: Critical > Labels: correctness > > The following test case involving a join followed by a sum aggregation > returns the wrong answer for the sum: > > {code:java} > val df = Seq( > (BigDecimal("1000"), 1), > (BigDecimal("1000"), 1), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2), > (BigDecimal("1000"), 2)).toDF("decNum", "intNum") > val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, > "intNum").agg(sum("decNum")) > scala> df2.show(40,false) > --- > sum(decNum) > --- > 4000.00 > --- > > {code} > > The result should be 104000.. > It appears a partial sum is computed for each join key, as the result > returned would be the answer for all rows matching intNum === 1. > If only the rows with intNum === 2 are included, the answer given is null: > > {code:java} > scala> val df3 = df.filter($"intNum" === lit(2)) > df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: > decimal(38,18), intNum: int] > scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, > "intNum").agg(sum("decNum")) > df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)] > scala> df4.show(40,false) > --- > sum(decNum) > --- > null > --- > > {code} > > The correct answer, 10., doesn't fit in > the DataType picked for the result, decimal(38,18), so an overflow occurs, > which Spark then converts to null. > The first example, which doesn't filter out the intNum === 1 values should > also return null, indicating overflow, but it doesn't. This may mislead the > user to think a valid sum was computed. > If whole-stage code gen is turned off: > spark.conf.set("spark.sql.codegen.wholeStage", false) > ... incorrect results are not returned because the overflow is caught as an > exception: > java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 > exceeds max precision 38 > > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26449) Missing Dataframe.transform API in Python API
[ https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033343#comment-17033343 ] Hyukjin Kwon commented on SPARK-26449: -- To match with Scala side. It should be easy to work around. > Missing Dataframe.transform API in Python API > - > > Key: SPARK-26449 > URL: https://issues.apache.org/jira/browse/SPARK-26449 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Hanan Shteingart >Assignee: Erik Christiansen >Priority: Minor > Fix For: 3.0.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > I would like to chain custom transformations as is suggested in this [blog > post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55] > This will allow to write something like the following: > > > {code:java} > > def with_greeting(df): > return df.withColumn("greeting", lit("hi")) > def with_something(df, something): > return df.withColumn("something", lit(something)) > data = [("jose", 1), ("li", 2), ("liz", 3)] > source_df = spark.createDataFrame(data, ["name", "age"]) > actual_df = (source_df > .transform(with_greeting) > .transform(lambda df: with_something(df, "crazy"))) > print(actual_df.show()) > ++---++-+ > |name|age|greeting|something| > ++---++-+ > |jose| 1| hi|crazy| > | li| 2| hi|crazy| > | liz| 3| hi|crazy| > ++---++-+ > {code} > The only thing needed to accomplish this is the following simple method for > DataFrame: > {code:java} > from pyspark.sql.dataframe import DataFrame > def transform(self, f): > return f(self) > DataFrame.transform = transform > {code} > I volunteer to do the pull request if approved (at least the python part) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30687) When reading from a file with pre-defined schema and encountering a single value that is not the same type as that of its column , Spark nullifies the entire row
[ https://issues.apache.org/jira/browse/SPARK-30687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033298#comment-17033298 ] Maxim Gekk commented on SPARK-30687: This feature will come with Spark 3.0 https://github.com/apache/spark/commit/11e5f1bcd49eec8ab4225d6e68a051b5c6a21cb2 . It wasn't backported to 2.4.x > When reading from a file with pre-defined schema and encountering a single > value that is not the same type as that of its column , Spark nullifies the > entire row > - > > Key: SPARK-30687 > URL: https://issues.apache.org/jira/browse/SPARK-30687 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bao Nguyen >Priority: Major > > When reading from a file with pre-defined schema and encountering a single > value that is not the same type as that of its column , Spark nullifies the > entire row instead of setting the value at that cell to be null. > > {code:java} > case class TestModel( > num: Double, test: String, mac: String, value: Double > ) > val schema = > ScalaReflection.schemaFor[TestModel].dataType.asInstanceOf[StructType] > //here's the content of the file test.data > //1~test~mac1~2 > //1.0~testdatarow2~mac2~non-numeric > //2~test1~mac1~3 > val ds = spark > .read > .schema(schema) > .option("delimiter", "~") > .csv("/test-data/test.data") > ds.show(); > //the content of data frame. second row is all null. > // ++-++-+ > // | num| test| mac|value| > // ++-++-+ > // | 1.0| test|mac1| 2.0| > // |null| null|null| null| > // | 2.0|test1|mac1| 3.0| > // ++-++-+ > //should be > // ++--++-+ > // | num| test | mac|value| > // ++--++-+ > // | 1.0| test |mac1| 2.0 | > // |1.0 |testdatarow2 |mac2| null| > // | 2.0|test1 |mac1| 3.0 | > // ++--++-+{code} > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30767) from_json changes times of timestmaps by several minutes without error
[ https://issues.apache.org/jira/browse/SPARK-30767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033293#comment-17033293 ] Maxim Gekk edited comment on SPARK-30767 at 2/9/20 9:08 PM: The default timestamp pattern in JSON datasource specifies only milliseconds but your input strings have timestamps in microsecond precision. You can change the pattern via: {code:scala} from_json(col("json"), struct, Map("timestampFormat" -> "-MM-dd'T'HH:mm:ss.SSXXX") {code} Just in case, it should work in Spark 3.0 preview and in Spark 2.4.5 was (Author: maxgekk): The default timestamp pattern in JSON datasource specifies only milliseconds but your input strings have timestamps in microsecond precision. You can change the pattern via: {code:scala} from_json(col("json"), struct, Map("timestampFormat" -> "-MM-dd'T'HH:mm:ss.SSXXX") {code} > from_json changes times of timestmaps by several minutes without error > --- > > Key: SPARK-30767 > URL: https://issues.apache.org/jira/browse/SPARK-30767 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 > Environment: We ran the example code with Spark 2.4.4 via Azure > Databricks with Databricks Runtime version 6.3 within an interactive cluster. > We encountered the issue first on a Job Cluster running a streaming > application on Databricks Runtime Version 5.4. >Reporter: Benedikt Maria Beckermann >Priority: Major > Labels: corruption > > When a json text column includes a timestamp and the timestamp has a format > like {{2020-01-25T06:39:45.887429Z}}, the function > {{from_json(Column,StructType)}} is able to infer a timestamp but that > timestamp is changed by several minutes. > Spark does not throw any kind of error but continues to run with the > invalidated timestamp. > The following scala snipped is able to reproduce the issue. > > {code:scala} > import org.apache.spark.sql._ > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val df = Seq("""{"time":"2020-01-25T06:39:45.887429Z"}""").toDF("json") > val struct = new StructType().add("time", TimestampType, nullable = true) > val timeDF = df > .withColumn("time (string)", get_json_object(col("json"), "$.time")) > .withColumn("time casted directly (CORRECT)", col("time > (string)").cast(TimestampType)) > .withColumn("time casted via struct (INVALID)", from_json(col("json"), > struct)) > display(timeDF) > {code} > Output: > ||json||time (string)||time casted directly (CORRECT)||time casted via struct > (INVALID) > |{"time":"2020-01-25T06:39:45.887429Z"}|2020-01-25T06:39:45.887429Z|2020-01-25T06:39:45.887+|{"time":"2020-01-25T06:54:32.429+"} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30767) from_json changes times of timestmaps by several minutes without error
[ https://issues.apache.org/jira/browse/SPARK-30767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033294#comment-17033294 ] Maxim Gekk commented on SPARK-30767: Also 2.4.4 supports only `SSS` for second fractions. That was fixed by SPARK-29904 and SPARK-29949 in 2.4.5. Please, ask Databricks support which DBR versions will have those fixes. > from_json changes times of timestmaps by several minutes without error > --- > > Key: SPARK-30767 > URL: https://issues.apache.org/jira/browse/SPARK-30767 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 > Environment: We ran the example code with Spark 2.4.4 via Azure > Databricks with Databricks Runtime version 6.3 within an interactive cluster. > We encountered the issue first on a Job Cluster running a streaming > application on Databricks Runtime Version 5.4. >Reporter: Benedikt Maria Beckermann >Priority: Major > Labels: corruption > > When a json text column includes a timestamp and the timestamp has a format > like {{2020-01-25T06:39:45.887429Z}}, the function > {{from_json(Column,StructType)}} is able to infer a timestamp but that > timestamp is changed by several minutes. > Spark does not throw any kind of error but continues to run with the > invalidated timestamp. > The following scala snipped is able to reproduce the issue. > > {code:scala} > import org.apache.spark.sql._ > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val df = Seq("""{"time":"2020-01-25T06:39:45.887429Z"}""").toDF("json") > val struct = new StructType().add("time", TimestampType, nullable = true) > val timeDF = df > .withColumn("time (string)", get_json_object(col("json"), "$.time")) > .withColumn("time casted directly (CORRECT)", col("time > (string)").cast(TimestampType)) > .withColumn("time casted via struct (INVALID)", from_json(col("json"), > struct)) > display(timeDF) > {code} > Output: > ||json||time (string)||time casted directly (CORRECT)||time casted via struct > (INVALID) > |{"time":"2020-01-25T06:39:45.887429Z"}|2020-01-25T06:39:45.887429Z|2020-01-25T06:39:45.887+|{"time":"2020-01-25T06:54:32.429+"} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30767) from_json changes times of timestmaps by several minutes without error
[ https://issues.apache.org/jira/browse/SPARK-30767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033293#comment-17033293 ] Maxim Gekk commented on SPARK-30767: The default timestamp pattern in JSON datasource specifies only milliseconds but your input strings have timestamps in microsecond precision. You can change the pattern via: {code:scala} from_json(col("json"), struct, Map("timestampFormat" -> "-MM-dd'T'HH:mm:ss.SSXXX") {code} > from_json changes times of timestmaps by several minutes without error > --- > > Key: SPARK-30767 > URL: https://issues.apache.org/jira/browse/SPARK-30767 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 > Environment: We ran the example code with Spark 2.4.4 via Azure > Databricks with Databricks Runtime version 6.3 within an interactive cluster. > We encountered the issue first on a Job Cluster running a streaming > application on Databricks Runtime Version 5.4. >Reporter: Benedikt Maria Beckermann >Priority: Major > Labels: corruption > > When a json text column includes a timestamp and the timestamp has a format > like {{2020-01-25T06:39:45.887429Z}}, the function > {{from_json(Column,StructType)}} is able to infer a timestamp but that > timestamp is changed by several minutes. > Spark does not throw any kind of error but continues to run with the > invalidated timestamp. > The following scala snipped is able to reproduce the issue. > > {code:scala} > import org.apache.spark.sql._ > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val df = Seq("""{"time":"2020-01-25T06:39:45.887429Z"}""").toDF("json") > val struct = new StructType().add("time", TimestampType, nullable = true) > val timeDF = df > .withColumn("time (string)", get_json_object(col("json"), "$.time")) > .withColumn("time casted directly (CORRECT)", col("time > (string)").cast(TimestampType)) > .withColumn("time casted via struct (INVALID)", from_json(col("json"), > struct)) > display(timeDF) > {code} > Output: > ||json||time (string)||time casted directly (CORRECT)||time casted via struct > (INVALID) > |{"time":"2020-01-25T06:39:45.887429Z"}|2020-01-25T06:39:45.887429Z|2020-01-25T06:39:45.887+|{"time":"2020-01-25T06:54:32.429+"} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30767) from_json changes times of timestmaps by several minutes without error
Benedikt Maria Beckermann created SPARK-30767: - Summary: from_json changes times of timestmaps by several minutes without error Key: SPARK-30767 URL: https://issues.apache.org/jira/browse/SPARK-30767 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4 Environment: We ran the example code with Spark 2.4.4 via Azure Databricks with Databricks Runtime version 6.3 within an interactive cluster. We encountered the issue first on a Job Cluster running a streaming application on Databricks Runtime Version 5.4. Reporter: Benedikt Maria Beckermann When a json text column includes a timestamp and the timestamp has a format like {{2020-01-25T06:39:45.887429Z}}, the function {{from_json(Column,StructType)}} is able to infer a timestamp but that timestamp is changed by several minutes. Spark does not throw any kind of error but continues to run with the invalidated timestamp. The following scala snipped is able to reproduce the issue. {code:scala} import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val df = Seq("""{"time":"2020-01-25T06:39:45.887429Z"}""").toDF("json") val struct = new StructType().add("time", TimestampType, nullable = true) val timeDF = df .withColumn("time (string)", get_json_object(col("json"), "$.time")) .withColumn("time casted directly (CORRECT)", col("time (string)").cast(TimestampType)) .withColumn("time casted via struct (INVALID)", from_json(col("json"), struct)) display(timeDF) {code} Output: ||json||time (string)||time casted directly (CORRECT)||time casted via struct (INVALID) |{"time":"2020-01-25T06:39:45.887429Z"}|2020-01-25T06:39:45.887429Z|2020-01-25T06:39:45.887+|{"time":"2020-01-25T06:54:32.429+"} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30766) Wrong truncation of old timestamps to hours and days
Maxim Gekk created SPARK-30766: -- Summary: Wrong truncation of old timestamps to hours and days Key: SPARK-30766 URL: https://issues.apache.org/jira/browse/SPARK-30766 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk The date_trunc() function incorrectly truncates old timestamps to HOUR and DAY: {code:scala} scala> import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions._ scala> :paste // Entering paste mode (ctrl-D to finish) Seq("0010-01-01 01:02:03.123456").toDF() .select($"value".cast("timestamp").as("ts")) .select(date_trunc("HOUR", $"ts").cast("string")) .show(false) // Exiting paste mode, now interpreting. ++ |CAST(date_trunc(HOUR, ts) AS STRING)| ++ |0010-01-01 01:30:17 | ++ {code} the result must be *0010-01-01 01:00:00* {code:scala} scala> :paste // Entering paste mode (ctrl-D to finish) Seq("0010-01-01 01:02:03.123456").toDF() .select($"value".cast("timestamp").as("ts")) .select(date_trunc("DAY", $"ts").cast("string")) .show(false) // Exiting paste mode, now interpreting. +---+ |CAST(date_trunc(DAY, ts) AS STRING)| +---+ |0010-01-01 23:30:17| +---+ {code} the result must be *0010-01-01 00:00:00* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30765) Refine baes class abstraction code style
Xin Wu created SPARK-30765: -- Summary: Refine baes class abstraction code style Key: SPARK-30765 URL: https://issues.apache.org/jira/browse/SPARK-30765 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Xin Wu When doing base operator abstraction work, I found there are still some code snippet is inconsistent with other abstraction code style. Case 1, override keyword missed for some fields in derived classes. The compiler will not capture it if we rename some fields in the future. [https://github.com/apache/spark/pull/27368#discussion_r376694045] Case 2, inconsistent abstract class definition. The updated style will simplify derived class definition. [https://github.com/apache/spark/pull/27368#discussion_r375061952] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30764) Improve the readability of EXPLAIN FORMATTED style
Xin Wu created SPARK-30764: -- Summary: Improve the readability of EXPLAIN FORMATTED style Key: SPARK-30764 URL: https://issues.apache.org/jira/browse/SPARK-30764 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Xin Wu The style of EXPLAIN FORMATTED output needs to be improved. We’ve already got some observations/ideas in [https://github.com/apache/spark/pull/27368#discussion_r376694496]. TODOs: 1.Using comma as the separator is not clear, especially commas are used inside the expressions too. 2.Show the column counts first? For example, `Results [4]: …` 3.Currently the attribute names are automatically generated, this need to refined. 4.Add arguments field in common implications as EXPLAIN FORMATTED did in QueryPlan ... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30763) Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract
[ https://issues.apache.org/jira/browse/SPARK-30763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-30763: --- Description: The current implement of regexp_extract will throws a unprocessed exception show below: SELECT regexp_extract('1a 2b 14m', ' d+') {code:java} [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 22.0 failed 1 times, most recent failure: Lost task 1.0 in stage 22.0 (TID 33, 192.168.1.6, executor driver): java.lang.IndexOutOfBoundsException: No group 1 [info] at java.util.regex.Matcher.group(Matcher.java:538) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) [info] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [info] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) [info] at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1804) [info] at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1227) [info] at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1227) [info] at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2156) [info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) [info] at org.apache.spark.scheduler.Task.run(Task.scala:127) [info] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) [info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:748) {code} I think should treat this exception well. was: The current implement of regexp_extract will throws a unprocessed exception show below: SELECT regexp_extract('1a 2b 14m', '\\d+') {code:java} [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 22.0 failed 1 times, most recent failure: Lost task 1.0 in stage 22.0 (TID 33, 192.168.1.6, executor driver): java.lang.IndexOutOfBoundsException: No group 2 [info] at java.util.regex.Matcher.group(Matcher.java:538) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) [info] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [info] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) [info] at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1804) [info] at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1227) [info] at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1227) [info] at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2156) [info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) [info] at org.apache.spark.scheduler.Task.run(Task.scala:127) [info] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) [info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:748) {code} I think should treat this exception well. > Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract > - > > Key: SPARK-30763 > URL: https://issues.apache.org/jira/browse/SPARK-30763 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: jiaan.geng >Priority: Major > > The current implement of regexp_extract will throws a unprocessed exception > show below: > SELECT regexp_extract('1a 2b 14m', ' > d+') > > {code:java} > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1 in stage 22.0 failed 1 times, most recent failure: Lost task 1.0 in > stage 22.0 (TID 33, 192.168.1.6, executor driver): > java.lang.IndexOutOfBoundsException: No group 1 > [info] at java.util.regex.Matcher.group(Matcher.java:538) > [info] at >
[jira] [Created] (SPARK-30763) Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract
jiaan.geng created SPARK-30763: -- Summary: Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract Key: SPARK-30763 URL: https://issues.apache.org/jira/browse/SPARK-30763 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5, 3.0.0 Reporter: jiaan.geng The current implement of regexp_extract will throws a unprocessed exception show below: SELECT regexp_extract('1a 2b 14m', '\\d+') {code:java} [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 22.0 failed 1 times, most recent failure: Lost task 1.0 in stage 22.0 (TID 33, 192.168.1.6, executor driver): java.lang.IndexOutOfBoundsException: No group 2 [info] at java.util.regex.Matcher.group(Matcher.java:538) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) [info] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [info] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) [info] at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1804) [info] at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1227) [info] at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1227) [info] at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2156) [info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) [info] at org.apache.spark.scheduler.Task.run(Task.scala:127) [info] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) [info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:748) {code} I think should treat this exception well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30510) Publicly document options under spark.sql.*
[ https://issues.apache.org/jira/browse/SPARK-30510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30510. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27459 [https://github.com/apache/spark/pull/27459] > Publicly document options under spark.sql.* > --- > > Key: SPARK-30510 > URL: https://issues.apache.org/jira/browse/SPARK-30510 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, > but it doesn't appear to be documented in [the expected > place|http://spark.apache.org/docs/2.4.4/configuration.html]. In fact, none > of the options under {{spark.sql.*}} that are intended for users are > documented on spark.apache.org/docs. > We should add a new documentation page for these options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30510) Publicly document options under spark.sql.*
[ https://issues.apache.org/jira/browse/SPARK-30510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30510: Assignee: Hyukjin Kwon > Publicly document options under spark.sql.* > --- > > Key: SPARK-30510 > URL: https://issues.apache.org/jira/browse/SPARK-30510 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Assignee: Hyukjin Kwon >Priority: Minor > > SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, > but it doesn't appear to be documented in [the expected > place|http://spark.apache.org/docs/2.4.4/configuration.html]. In fact, none > of the options under {{spark.sql.*}} that are intended for users are > documented on spark.apache.org/docs. > We should add a new documentation page for these options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF
[ https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033151#comment-17033151 ] Liang Zhang commented on SPARK-30762: - I'm now working on this issue. > Add dtype="float32" support to vector_to_array UDF > -- > > Key: SPARK-30762 > URL: https://issues.apache.org/jira/browse/SPARK-30762 > Project: Spark > Issue Type: Story > Components: MLlib >Affects Versions: 3.0.0 >Reporter: Liang Zhang >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > Previous PR: > [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org