[jira] [Commented] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-09 Thread Frederik Schreiber (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033421#comment-17033421
 ] 

Frederik Schreiber commented on SPARK-30711:


Hey here is my setting:

spark.sql.codegen.fallback: true (default)

So yes the fallback works for me, but i`m wondering why the exception is print 
es error message with full stack trace if that is a regular path in the code. 
Shouldn`t that be an warning instead of an error message? So can i ignore this 
error message completely?

 

How i can avoid that byte code grows up to fast over 64kb?

thanks, Frederik

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> 

[jira] [Created] (SPARK-30769) insertInto() with existing column as partition key cause weird partition result

2020-02-09 Thread Woong Seok Kang (Jira)
Woong Seok Kang created SPARK-30769:
---

 Summary: insertInto() with existing column as partition key cause 
weird partition result
 Key: SPARK-30769
 URL: https://issues.apache.org/jira/browse/SPARK-30769
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
 Environment: EMR 5.29.0 with Spark 2.4.4
Reporter: Woong Seok Kang


{code:java}
val tableName = s"${config.service}_$saveDatabase.${config.table}_partitioned"
val writer = TableWriter.getWriter(tableDF.withColumn(config.dateColumn, 
typedLit[String](date.toString))) 

if (xsc.tableExistIn(config.service, saveDatabase, 
s"${config.table}_partitioned")) writer.insertInto(tableName)
else writer.partitionBy(config.dateColumn).saveAsTable(tableName){code}
This code checks whether table exists in desired path. (somewhere in S3 in this 
case) If table already exists in path then insert a new partition with 
insertInto() function.

If config.dateColumn not exists in table schema, no problem occurred. (just new 
column will be added) but if it is already exists in schema, Spark does not use 
given column as a partition key, instead it will create a hundred of 
partitions. Below is a part of Spark logs:
(Note that the name of partition column is date_ymd, which is already exists in 
source table. original value is a date string like '2020-01-01')


20/02/10 05:33:01 INFO S3NativeFileSystem2: rename 
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=174
 s3://\{my_path_at_s3}_partitioned_test/date_ymd=174
20/02/10 05:33:02 INFO S3NativeFileSystem2: rename 
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=62
 s3://\{my_path_at_s3}_partitioned_test/date_ymd=62
20/02/10 05:33:02 INFO S3NativeFileSystem2: rename 
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=83
 s3://\{my_path_at_s3}_partitioned_test/date_ymd=83
20/02/10 05:33:03 INFO S3NativeFileSystem2: rename 
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=231
 s3://\{my_path_at_s3}_partitioned_test/date_ymd=231
20/02/10 05:33:03 INFO S3NativeFileSystem2: rename 
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=268
 s3://\{my_path_at_s3}_partitioned_test/date_ymd=268
20/02/10 05:33:04 INFO S3NativeFileSystem2: rename 
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=33
 s3://\{my_path_at_s3}_partitioned_test/date_ymd=33
20/02/10 05:33:05 INFO S3NativeFileSystem2: rename 
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=40
 s3://\{my_path_at_s3}_partitioned_test/date_ymd=40
rename 
s3://\{my_path_at_s3}_partitioned_test/.spark-staging-e3c1c1fc-6bbe-4e77-8b7f-201cfd60d061/date_ymd=__HIVE_DEFAULT_PARTITION__
 s3://\{my_path_at_s3}_partitioned_test/date_ymd=__HIVE_DEFAULT_PARTITION__

When I use different partition key which not in table schema such as 
'stamp_date', everything goes fine. I'm not sure that it is a Spark bugs, I 
just wrote the report. (I think it is related with Hive...)



Thanks for reading!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30768) Constraints should be inferred from inequality attributes

2020-02-09 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-30768:

Description: 
How to reproduce:
{code:sql}
create table SPARK_30768_1(c1 int, c2 int);
create table SPARK_30768_2(c1 int, c2 int);
{code}


*Spark SQL*:
{noformat}
spark-sql> explain select t1.* from SPARK_30768_1 t1 join SPARK_30768_2 t2 on 
(t1.c1 > t2.c1) where t1.c1 = 3;
== Physical Plan ==
*(3) Project [c1#5, c2#6]
+- BroadcastNestedLoopJoin BuildRight, Inner, (c1#5 > c1#7)
   :- *(1) Project [c1#5, c2#6]
   :  +- *(1) Filter (isnotnull(c1#5) AND (c1#5 = 3))
   : +- *(1) ColumnarToRow
   :+- FileScan parquet default.spark_30768_1[c1#5,c2#6] Batched: true, 
DataFilters: [isnotnull(c1#5), (c1#5 = 3)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(c1), EqualTo(c1,3)], 
ReadSchema: struct
   +- BroadcastExchange IdentityBroadcastMode, [id=#60]
  +- *(2) Project [c1#7]
 +- *(2) Filter isnotnull(c1#7)
+- *(2) ColumnarToRow
   +- FileScan parquet default.spark_30768_2[c1#7] Batched: true, 
DataFilters: [isnotnull(c1#7)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
 PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: 
struct
{noformat}

*Hive* support this feature:
{noformat}
hive> explain select t1.* from SPARK_30768_1 t1 join SPARK_30768_2 t2 on (t1.c1 
> t2.c1) where t1.c1 = 3;
Warning: Map Join MAPJOIN[13][bigTable=?] in task 'Stage-3:MAPRED' is a cross 
product
OK
STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
Map Reduce Local Work
  Alias -> Map Local Tables:
$hdt$_0:t1
  Fetch Operator
limit: -1
  Alias -> Map Local Operator Tree:
$hdt$_0:t1
  TableScan
alias: t1
filterExpr: (c1 = 3) (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
Filter Operator
  predicate: (c1 = 3) (type: boolean)
  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
  Select Operator
expressions: c2 (type: int)
outputColumnNames: _col1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
HashTable Sink Operator
  keys:
0
1

  Stage: Stage-3
Map Reduce
  Map Operator Tree:
  TableScan
alias: t2
filterExpr: (c1 < 3) (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
Filter Operator
  predicate: (c1 < 3) (type: boolean)
  Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
stats: NONE
  Select Operator
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
Column stats: NONE
Map Join Operator
  condition map:
   Inner Join 0 to 1
  keys:
0
1
  outputColumnNames: _col1
  Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL 
Column stats: NONE
  Select Operator
expressions: 3 (type: int), _col1 (type: int)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL 
Column stats: NONE
File Output Operator
  compressed: false
  Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL 
Column stats: NONE
  table:
  input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
  serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Execution mode: vectorized
  Local Work:
Map Reduce Local Work

  Stage: Stage-0
Fetch Operator
  limit: -1
  Processor Tree:
ListSink

Time taken: 5.491 seconds, Fetched: 71 row(s)
{noformat}


> Constraints should be inferred from inequality attributes
> -
>
> Key: SPARK-30768
> URL: https://issues.apache.org/jira/browse/SPARK-30768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: 

[jira] [Created] (SPARK-30768) Constraints should be inferred from inequality attributes

2020-02-09 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-30768:
---

 Summary: Constraints should be inferred from inequality attributes
 Key: SPARK-30768
 URL: https://issues.apache.org/jira/browse/SPARK-30768
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF

2020-02-09 Thread Liang Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033376#comment-17033376
 ] 

Liang Zhang commented on SPARK-30762:
-

Hi Hyukjin, I updated the description.

> Add dtype="float32" support to vector_to_array UDF
> --
>
> Key: SPARK-30762
> URL: https://issues.apache.org/jira/browse/SPARK-30762
> Project: Spark
>  Issue Type: Story
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Liang Zhang
>Assignee: Liang Zhang
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Previous PR: 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py]
> In the previous PR, we introduced a UDF to convert a column of MLlib Vecters 
> to a column of lists in python (Seq in scala). Currently, all the floating 
> numbers in a vector is converted to Double in scala. In this issue, we will 
> add a parameter in the python function {{vector_to_array(col)}} that allows 
> converting to Float (32bits) in scala, which would be mapped to a numpy array 
> of dtype=float32.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF

2020-02-09 Thread Liang Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang Zhang updated SPARK-30762:

Description: 
Previous PR: 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py]

In the previous PR, we introduced a UDF to convert a column of MLlib Vecters to 
a column of lists in python (Seq in scala). Currently, all the floating numbers 
in a vector is converted to Double in scala. In this issue, we will add a 
parameter in the python function {{vector_to_array(col)}} that allows 
converting to Float (32bits) in scala, which would be mapped to a numpy array 
of dtype=float32.

 

  was:
Previous PR: 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py]

 

 


> Add dtype="float32" support to vector_to_array UDF
> --
>
> Key: SPARK-30762
> URL: https://issues.apache.org/jira/browse/SPARK-30762
> Project: Spark
>  Issue Type: Story
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Liang Zhang
>Assignee: Liang Zhang
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Previous PR: 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py]
> In the previous PR, we introduced a UDF to convert a column of MLlib Vecters 
> to a column of lists in python (Seq in scala). Currently, all the floating 
> numbers in a vector is converted to Double in scala. In this issue, we will 
> add a parameter in the python function {{vector_to_array(col)}} that allows 
> converting to Float (32bits) in scala, which would be mapped to a numpy array 
> of dtype=float32.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30619) org.slf4j.Logger and org.apache.commons.collections classes not built as part of hadoop-provided profile

2020-02-09 Thread Abhishek Rao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033374#comment-17033374
 ] 

Abhishek Rao commented on SPARK-30619:
--

Hi [~hyukjin.kwon]

I just built container using spark-2.4.4-bin-without-hadoop.tgz. Here is the 
spark-submit command that I used.

./spark-submit --deploy-mode cluster --class 
org.apache.spark.examples.JavaWordCount --master k8s://https:// 
--name spark-test --conf spark.kubernetes.container.image= --conf 
spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa --conf 
spark.kubernetes.namespace=spark 
local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar 
file:///opt/spark/RELEASE

 

> org.slf4j.Logger and org.apache.commons.collections classes not built as part 
> of hadoop-provided profile
> 
>
> Key: SPARK-30619
> URL: https://issues.apache.org/jira/browse/SPARK-30619
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.2, 2.4.4
> Environment: Spark on kubernetes
>Reporter: Abhishek Rao
>Priority: Major
>
> We're using spark-2.4.4-bin-without-hadoop.tgz and executing Java Word count 
> (org.apache.spark.examples.JavaWordCount) example on local files.
> But we're seeing that it is expecting org.slf4j.Logger and 
> org.apache.commons.collections classes to be available for executing this.
> We expected the binary to work as it is for local files. Is there anything 
> which we're missing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF

2020-02-09 Thread Xiangrui Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-30762:
-

Assignee: Liang Zhang

> Add dtype="float32" support to vector_to_array UDF
> --
>
> Key: SPARK-30762
> URL: https://issues.apache.org/jira/browse/SPARK-30762
> Project: Spark
>  Issue Type: Story
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Liang Zhang
>Assignee: Liang Zhang
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Previous PR: 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF

2020-02-09 Thread Xiangrui Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-30762:
--
Component/s: PySpark

> Add dtype="float32" support to vector_to_array UDF
> --
>
> Key: SPARK-30762
> URL: https://issues.apache.org/jira/browse/SPARK-30762
> Project: Spark
>  Issue Type: Story
>  Components: MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: Liang Zhang
>Assignee: Liang Zhang
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Previous PR: 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-29721) Spark SQL reads unnecessary nested fields after using explode

2020-02-09 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-29721:
-

> Spark SQL reads unnecessary nested fields after using explode
> -
>
> Key: SPARK-29721
> URL: https://issues.apache.org/jira/browse/SPARK-29721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Kai Kang
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> This is a follow up for SPARK-4502. SPARK-4502 correctly addressed column 
> pruning for nested structures. However, when explode() is called on a nested 
> field, all columns for that nested structure is still fetched from data 
> source.
> We are working on a project to create a parquet store for a big pre-joined 
> table between two tables that has one-to-many relationship, and this is a 
> blocking issue for us.
>  
> The following code illustrates the issue. 
> Part 1: loading some nested data
> {noformat}
> val jsonStr = """{
>  "items": [
>{"itemId": 1, "itemData": "a"},
>{"itemId": 2, "itemData": "b"}
>  ]
> }"""
> val df = spark.read.json(Seq(jsonStr).toDS)
> df.write.format("parquet").mode("overwrite").saveAsTable("persisted")
> {noformat}
>  
> Part 2: reading it back and explaining the queries
> {noformat}
> val read = spark.table("persisted")
> spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", true)
> // pruned, only loading itemId
> // ReadSchema: struct>>
> read.select($"items.itemId").explain(true) 
> // not pruned, loading both itemId 
> // ReadSchema: struct>>
> read.select(explode($"items.itemId")).explain(true) and itemData
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30614) The native ALTER COLUMN syntax should change one thing at a time

2020-02-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30614.
-
Fix Version/s: 3.0.0
 Assignee: Terry Kim
   Resolution: Fixed

> The native ALTER COLUMN syntax should change one thing at a time
> 
>
> Key: SPARK-30614
> URL: https://issues.apache.org/jira/browse/SPARK-30614
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> Our native ALTER COLUMN syntax is newly added in 3.0 and almost follows the 
> SQL standard.
> {code}
> ALTER TABLE table=multipartIdentifier
>   (ALTER | CHANGE) COLUMN? column=multipartIdentifier
>   (TYPE dataType)?
>   (COMMENT comment=STRING)?
>   colPosition?   
> {code}
> The SQL standard (section 11.12) only allows changing one property at a time. 
> This is also true on other recent SQL systems like 
> snowflake(https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html)
>  and 
> redshift(https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html)
> The snowflake has an extension that it allows changing multiple columns at a 
> time, like ALTER COLUMN c1 TYPE int, c2 TYPE int. If we want to extend the 
> SQL standard, I think this syntax is better. 
> For now, let's be conservative and only allow changing one property at a time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30732) BroadcastExchangeExec does not fully honor "spark.broadcast.compress"

2020-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30732.
--
Resolution: Won't Fix

> BroadcastExchangeExec does not fully honor "spark.broadcast.compress"
> -
>
> Key: SPARK-30732
> URL: https://issues.apache.org/jira/browse/SPARK-30732
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Puneet
>Priority: Major
>
> Setting {{spark.broadcast.compress}} to false disables compression while 
> sending broadcast variable to executors 
> ([https://spark.apache.org/docs/latest/configuration.html#compression-and-serialization])
> However this does not disable compression for any child relations sent by the 
> executors to the driver. 
> Setting spark.boradcast.compress to false should disable both sides of the 
> traffic, allowing users to disable compression for the whole broadcast join 
> for example.
> [https://github.com/puneetguptanitj/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L89]
> ^here `executeCollectIterator` calls `getByteArrayRdd` which by default 
> always gets a compressed stream
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30741) The data returned from SAS using JDBC reader contains column label

2020-02-09 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033350#comment-17033350
 ] 

Hyukjin Kwon commented on SPARK-30741:
--

Spark 2.1.x is EOL. Can you try it in a higher version? Also, if possible, 
please provide reproducible steps; otherwise, no one can verify it.

> The data returned from SAS using JDBC reader contains column label
> --
>
> Key: SPARK-30741
> URL: https://issues.apache.org/jira/browse/SPARK-30741
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 2.1.1
>Reporter: Gary Liu
>Priority: Major
> Attachments: SparkBug.png
>
>
> When read SAS data using JDBC with SAS SHARE driver, the returned data 
> contains column labels, rather data. 
> According to testing result from SAS Support, the results are correct using 
> Java. So they believe it is due to spark reading. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30740) months_between wrong calculation

2020-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30740:
-
Priority: Major  (was: Critical)

> months_between wrong calculation
> 
>
> Key: SPARK-30740
> URL: https://issues.apache.org/jira/browse/SPARK-30740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: nhufas
>Priority: Major
>
> months_between not calculating right for February
> example
>  
> {{select }}
> {{ months_between('2020-02-29','2019-12-29')}}
> {{,months_between('2020-02-29','2019-12-30') }}
> {{,months_between('2020-02-29','2019-12-31') }}
>  
> will generate a result like this 
> |2|1.96774194|2|
>  
> For 2019-12-30 is calculating wrong.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30741) The data returned from SAS using JDBC reader contains column label

2020-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30741.
--
Resolution: Incomplete

> The data returned from SAS using JDBC reader contains column label
> --
>
> Key: SPARK-30741
> URL: https://issues.apache.org/jira/browse/SPARK-30741
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, PySpark
>Affects Versions: 2.1.1
>Reporter: Gary Liu
>Priority: Major
> Attachments: SparkBug.png
>
>
> When read SAS data using JDBC with SAS SHARE driver, the returned data 
> contains column labels, rather data. 
> According to testing result from SAS Support, the results are correct using 
> Java. So they believe it is due to spark reading. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30745) Spark streaming, kafka broker error, "Failed to get records for spark-executor- .... after polling for 512"

2020-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30745.
--
Resolution: Incomplete

> Spark streaming, kafka broker error, "Failed to get records for 
> spark-executor-  after polling for 512"
> ---
>
> Key: SPARK-30745
> URL: https://issues.apache.org/jira/browse/SPARK-30745
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Deploy, DStreams, Kubernetes
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2, Kafka 0.10
>Reporter: Harneet K
>Priority: Major
>  Labels: Spark2.0.2, cluster, kafka-0.10, spark-streaming-kafka
>
> We have a spark streaming application reading data from Kafka.
>  Data size: 15 Million
> Below errors were seen:
>  java.lang.AssertionError: assertion failed: Failed to get records for 
> spark-executor-  after polling for 512 at 
> scala.Predef$.assert(Predef.scala:170)
> There were more errors seen pertaining to CachedKafkaConsumer
>  at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>  
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) 
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) 
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>  at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) 
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>  at org.apache.spark.scheduler.Task.run(Task.scala:86)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
>  
> The spark.streaming.kafka.consumer.poll.ms is set to default 512ms and other 
> Kafka stream timeout settings are default.
>  "request.timeout.ms" 
>  "heartbeat.interval.ms" 
>  "session.timeout.ms" 
>  "max.poll.interval.ms" 
> Also, the kafka is being recently updated to 0.10 from 0.8. In 0.8, this 
> behavior was not seen. 
> No resource issues are seen.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30745) Spark streaming, kafka broker error, "Failed to get records for spark-executor- .... after polling for 512"

2020-02-09 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033348#comment-17033348
 ] 

Hyukjin Kwon commented on SPARK-30745:
--

Spark 2.0.x is EOL. Can you try it in a higher version?

> Spark streaming, kafka broker error, "Failed to get records for 
> spark-executor-  after polling for 512"
> ---
>
> Key: SPARK-30745
> URL: https://issues.apache.org/jira/browse/SPARK-30745
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Deploy, DStreams, Kubernetes
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2, Kafka 0.10
>Reporter: Harneet K
>Priority: Major
>  Labels: Spark2.0.2, cluster, kafka-0.10, spark-streaming-kafka
>
> We have a spark streaming application reading data from Kafka.
>  Data size: 15 Million
> Below errors were seen:
>  java.lang.AssertionError: assertion failed: Failed to get records for 
> spark-executor-  after polling for 512 at 
> scala.Predef$.assert(Predef.scala:170)
> There were more errors seen pertaining to CachedKafkaConsumer
>  at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>  at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>  
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) 
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) 
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>  at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) 
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>  at org.apache.spark.scheduler.Task.run(Task.scala:86)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
>  
> The spark.streaming.kafka.consumer.poll.ms is set to default 512ms and other 
> Kafka stream timeout settings are default.
>  "request.timeout.ms" 
>  "heartbeat.interval.ms" 
>  "session.timeout.ms" 
>  "max.poll.interval.ms" 
> Also, the kafka is being recently updated to 0.10 from 0.8. In 0.8, this 
> behavior was not seen. 
> No resource issues are seen.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF

2020-02-09 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033346#comment-17033346
 ] 

Hyukjin Kwon commented on SPARK-30762:
--

Sorry, can you clarify what you mean by dtype support?

> Add dtype="float32" support to vector_to_array UDF
> --
>
> Key: SPARK-30762
> URL: https://issues.apache.org/jira/browse/SPARK-30762
> Project: Spark
>  Issue Type: Story
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Liang Zhang
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Previous PR: 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30767) from_json changes times of timestmaps by several minutes without error

2020-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30767.
--
Resolution: Not A Problem

> from_json changes times of timestmaps by several minutes without error 
> ---
>
> Key: SPARK-30767
> URL: https://issues.apache.org/jira/browse/SPARK-30767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: We ran the example code with Spark 2.4.4 via Azure 
> Databricks with Databricks Runtime version 6.3 within an interactive cluster. 
> We encountered the issue first on a Job Cluster running a streaming 
> application on Databricks Runtime Version 5.4.
>Reporter: Benedikt Maria Beckermann
>Priority: Major
>
> When a json text column includes a timestamp and the timestamp has a format 
> like {{2020-01-25T06:39:45.887429Z}}, the function 
> {{from_json(Column,StructType)}} is able to infer a timestamp but that 
> timestamp is changed by several minutes. 
> Spark does not throw any kind of error but continues to run with the 
> invalidated timestamp. 
> The following scala snipped is able to reproduce the issue.
>  
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val df = Seq("""{"time":"2020-01-25T06:39:45.887429Z"}""").toDF("json")
> val struct = new StructType().add("time", TimestampType, nullable = true)
> val timeDF = df
>   .withColumn("time (string)", get_json_object(col("json"), "$.time"))
>   .withColumn("time casted directly (CORRECT)", col("time 
> (string)").cast(TimestampType))
>   .withColumn("time casted via struct (INVALID)", from_json(col("json"), 
> struct))
> display(timeDF)
> {code}
> Output: 
> ||json||time (string)||time casted directly (CORRECT)||time casted via struct 
> (INVALID)
> |{"time":"2020-01-25T06:39:45.887429Z"}|2020-01-25T06:39:45.887429Z|2020-01-25T06:39:45.887+|{"time":"2020-01-25T06:54:32.429+"}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30767) from_json changes times of timestmaps by several minutes without error

2020-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30767:
-
Labels:   (was: corruption)

> from_json changes times of timestmaps by several minutes without error 
> ---
>
> Key: SPARK-30767
> URL: https://issues.apache.org/jira/browse/SPARK-30767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: We ran the example code with Spark 2.4.4 via Azure 
> Databricks with Databricks Runtime version 6.3 within an interactive cluster. 
> We encountered the issue first on a Job Cluster running a streaming 
> application on Databricks Runtime Version 5.4.
>Reporter: Benedikt Maria Beckermann
>Priority: Major
>
> When a json text column includes a timestamp and the timestamp has a format 
> like {{2020-01-25T06:39:45.887429Z}}, the function 
> {{from_json(Column,StructType)}} is able to infer a timestamp but that 
> timestamp is changed by several minutes. 
> Spark does not throw any kind of error but continues to run with the 
> invalidated timestamp. 
> The following scala snipped is able to reproduce the issue.
>  
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val df = Seq("""{"time":"2020-01-25T06:39:45.887429Z"}""").toDF("json")
> val struct = new StructType().add("time", TimestampType, nullable = true)
> val timeDF = df
>   .withColumn("time (string)", get_json_object(col("json"), "$.time"))
>   .withColumn("time casted directly (CORRECT)", col("time 
> (string)").cast(TimestampType))
>   .withColumn("time casted via struct (INVALID)", from_json(col("json"), 
> struct))
> display(timeDF)
> {code}
> Output: 
> ||json||time (string)||time casted directly (CORRECT)||time casted via struct 
> (INVALID)
> |{"time":"2020-01-25T06:39:45.887429Z"}|2020-01-25T06:39:45.887429Z|2020-01-25T06:39:45.887+|{"time":"2020-01-25T06:54:32.429+"}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30711) 64KB JVM bytecode limit - janino.InternalCompilerException

2020-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30711.
--
Resolution: Cannot Reproduce

> 64KB JVM bytecode limit - janino.InternalCompilerException
> --
>
> Key: SPARK-30711
> URL: https://issues.apache.org/jira/browse/SPARK-30711
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
> Environment: Windows 10
> Spark 2.4.4
> scalaVersion 2.11.12
> JVM Oracle 1.8.0_221-b11
>Reporter: Frederik Schreiber
>Priority: Major
>
> Exception
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KBorg.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage4"
>  grows beyond 64 KB at 
> org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:382) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:237) at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:465)
>  at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:207) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1290)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1372)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1369)
>  at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) at 
> org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:584)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:583)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) 
> at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) 
> at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3384)
>  at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2783) 
> at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2783) at 
> 

[jira] [Resolved] (SPARK-30687) When reading from a file with pre-defined schema and encountering a single value that is not the same type as that of its column , Spark nullifies the entire row

2020-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30687.
--
Resolution: Cannot Reproduce

> When reading from a file with pre-defined schema and encountering a single 
> value that is not the same type as that of its column , Spark nullifies the 
> entire row
> -
>
> Key: SPARK-30687
> URL: https://issues.apache.org/jira/browse/SPARK-30687
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bao Nguyen
>Priority: Major
>
> When reading from a file with pre-defined schema and encountering a single 
> value that is not the same type as that of its column , Spark nullifies the 
> entire row instead of setting the value at that cell to be null.
>  
> {code:java}
> case class TestModel(
>   num: Double, test: String, mac: String, value: Double
> )
> val schema = 
> ScalaReflection.schemaFor[TestModel].dataType.asInstanceOf[StructType]
> //here's the content of the file test.data
> //1~test~mac1~2
> //1.0~testdatarow2~mac2~non-numeric
> //2~test1~mac1~3
> val ds = spark
>   .read
>   .schema(schema)
>   .option("delimiter", "~")
>   .csv("/test-data/test.data")
> ds.show();
> //the content of data frame. second row is all null. 
> //  ++-++-+
> //  | num| test| mac|value|
> //  ++-++-+
> //  | 1.0| test|mac1|  2.0|
> //  |null| null|null| null|
> //  | 2.0|test1|mac1|  3.0|
> //  ++-++-+
> //should be
> // ++--++-+ 
> // | num| test | mac|value| 
> // ++--++-+ 
> // | 1.0| test |mac1| 2.0 | 
> // |1.0 |testdatarow2  |mac2| null| 
> // | 2.0|test1 |mac1| 3.0 | 
> // ++--++-+{code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30688:
-
Affects Version/s: 3.0.0
   2.4.4

> Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
> --
>
> Key: SPARK-30688
> URL: https://issues.apache.org/jira/browse/SPARK-30688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.4, 3.0.0
>Reporter: Rajkumar Singh
>Priority: Major
>
>  
> {code:java}
> scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
> +-+
> |unix_timestamp(20201, ww)|
> +-+
> |                         null|
> +-+
>  
> scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
> -+
> |unix_timestamp(20202, ww)|
> +-+
> |                   1578182400|
> +-+
>  
> {code}
>  
>  
> This seems to happen for leap year only, I dig deeper into it and it seems 
> that  Spark is using the java.text.SimpleDateFormat and try to parse the 
> expression here
> [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
> {code:java}
> formatter.parse(
>  t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
>  but fail and SimpleDateFormat unable to parse the date throw Unparseable 
> Exception but Spark handle it silently and returns NULL.
>  
> *Spark-3.0:* I did some tests where spark no longer using the legacy 
> java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
> expect a valid date with valid format
>  org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30619) org.slf4j.Logger and org.apache.commons.collections classes not built as part of hadoop-provided profile

2020-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30619.
--
Resolution: Cannot Reproduce

> org.slf4j.Logger and org.apache.commons.collections classes not built as part 
> of hadoop-provided profile
> 
>
> Key: SPARK-30619
> URL: https://issues.apache.org/jira/browse/SPARK-30619
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.2, 2.4.4
> Environment: Spark on kubernetes
>Reporter: Abhishek Rao
>Priority: Major
>
> We're using spark-2.4.4-bin-without-hadoop.tgz and executing Java Word count 
> (org.apache.spark.examples.JavaWordCount) example on local files.
> But we're seeing that it is expecting org.slf4j.Logger and 
> org.apache.commons.collections classes to be available for executing this.
> We expected the binary to work as it is for local files. Is there anything 
> which we're missing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30619) org.slf4j.Logger and org.apache.commons.collections classes not built as part of hadoop-provided profile

2020-02-09 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033345#comment-17033345
 ] 

Hyukjin Kwon commented on SPARK-30619:
--

[~abhisrao], can you show the exact step you did so I can follow? I can't 
reproduce.

> org.slf4j.Logger and org.apache.commons.collections classes not built as part 
> of hadoop-provided profile
> 
>
> Key: SPARK-30619
> URL: https://issues.apache.org/jira/browse/SPARK-30619
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.2, 2.4.4
> Environment: Spark on kubernetes
>Reporter: Abhishek Rao
>Priority: Major
>
> We're using spark-2.4.4-bin-without-hadoop.tgz and executing Java Word count 
> (org.apache.spark.examples.JavaWordCount) example on local files.
> But we're seeing that it is expecting org.slf4j.Logger and 
> org.apache.commons.collections classes to be available for executing this.
> We expected the binary to work as it is for local files. Is there anything 
> which we're missing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled

2020-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28067:
-
Priority: Critical  (was: Blocker)

> Incorrect results in decimal aggregation with whole-stage code gen enabled
> --
>
> Key: SPARK-28067
> URL: https://issues.apache.org/jira/browse/SPARK-28067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Mark Sirek
>Priority: Critical
>  Labels: correctness
>
> The following test case involving a join followed by a sum aggregation 
> returns the wrong answer for the sum:
>  
> {code:java}
> val df = Seq(
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
> "intNum").agg(sum("decNum"))
> scala> df2.show(40,false)
>  ---
> sum(decNum)
> ---
> 4000.00
> ---
>  
> {code}
>  
> The result should be 104000..
> It appears a partial sum is computed for each join key, as the result 
> returned would be the answer for all rows matching intNum === 1.
> If only the rows with intNum === 2 are included, the answer given is null:
>  
> {code:java}
> scala> val df3 = df.filter($"intNum" === lit(2))
>  df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: 
> decimal(38,18), intNum: int]
> scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, 
> "intNum").agg(sum("decNum"))
>  df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
> scala> df4.show(40,false)
>  ---
> sum(decNum)
> ---
> null
> ---
>  
> {code}
>  
> The correct answer, 10., doesn't fit in 
> the DataType picked for the result, decimal(38,18), so an overflow occurs, 
> which Spark then converts to null.
> The first example, which doesn't filter out the intNum === 1 values should 
> also return null, indicating overflow, but it doesn't.  This may mislead the 
> user to think a valid sum was computed.
> If whole-stage code gen is turned off:
> spark.conf.set("spark.sql.codegen.wholeStage", false)
> ... incorrect results are not returned because the overflow is caught as an 
> exception:
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 
> exceeds max precision 38
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled

2020-02-09 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033344#comment-17033344
 ] 

Hyukjin Kwon commented on SPARK-28067:
--

I am lowering the priority given that {{spark.sql.codegen.wholeStage}} defaults 
to {{true}}.

> Incorrect results in decimal aggregation with whole-stage code gen enabled
> --
>
> Key: SPARK-28067
> URL: https://issues.apache.org/jira/browse/SPARK-28067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Mark Sirek
>Priority: Critical
>  Labels: correctness
>
> The following test case involving a join followed by a sum aggregation 
> returns the wrong answer for the sum:
>  
> {code:java}
> val df = Seq(
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
> "intNum").agg(sum("decNum"))
> scala> df2.show(40,false)
>  ---
> sum(decNum)
> ---
> 4000.00
> ---
>  
> {code}
>  
> The result should be 104000..
> It appears a partial sum is computed for each join key, as the result 
> returned would be the answer for all rows matching intNum === 1.
> If only the rows with intNum === 2 are included, the answer given is null:
>  
> {code:java}
> scala> val df3 = df.filter($"intNum" === lit(2))
>  df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: 
> decimal(38,18), intNum: int]
> scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, 
> "intNum").agg(sum("decNum"))
>  df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
> scala> df4.show(40,false)
>  ---
> sum(decNum)
> ---
> null
> ---
>  
> {code}
>  
> The correct answer, 10., doesn't fit in 
> the DataType picked for the result, decimal(38,18), so an overflow occurs, 
> which Spark then converts to null.
> The first example, which doesn't filter out the intNum === 1 values should 
> also return null, indicating overflow, but it doesn't.  This may mislead the 
> user to think a valid sum was computed.
> If whole-stage code gen is turned off:
> spark.conf.set("spark.sql.codegen.wholeStage", false)
> ... incorrect results are not returned because the overflow is caught as an 
> exception:
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 
> exceeds max precision 38
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26449) Missing Dataframe.transform API in Python API

2020-02-09 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033343#comment-17033343
 ] 

Hyukjin Kwon commented on SPARK-26449:
--

To match with Scala side. It should be easy to work around.

> Missing Dataframe.transform API in Python API
> -
>
> Key: SPARK-26449
> URL: https://issues.apache.org/jira/browse/SPARK-26449
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Hanan Shteingart
>Assignee: Erik Christiansen
>Priority: Minor
> Fix For: 3.0.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I would like to chain custom transformations as is suggested in this [blog 
> post|https://medium.com/@mrpowers/chaining-custom-pyspark-transformations-4f38a8c7ae55]
> This will allow to write something like the following:
>  
>  
> {code:java}
>  
> def with_greeting(df):
> return df.withColumn("greeting", lit("hi"))
> def with_something(df, something):
> return df.withColumn("something", lit(something))
> data = [("jose", 1), ("li", 2), ("liz", 3)]
> source_df = spark.createDataFrame(data, ["name", "age"])
> actual_df = (source_df
> .transform(with_greeting)
> .transform(lambda df: with_something(df, "crazy")))
> print(actual_df.show())
> ++---++-+
> |name|age|greeting|something|
> ++---++-+
> |jose|  1|  hi|crazy|
> |  li|  2|  hi|crazy|
> | liz|  3|  hi|crazy|
> ++---++-+
> {code}
> The only thing needed to accomplish this is the following simple method for 
> DataFrame:
> {code:java}
> from pyspark.sql.dataframe import DataFrame 
> def transform(self, f): 
> return f(self) 
> DataFrame.transform = transform
> {code}
> I volunteer to do the pull request if approved (at least the python part)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30687) When reading from a file with pre-defined schema and encountering a single value that is not the same type as that of its column , Spark nullifies the entire row

2020-02-09 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033298#comment-17033298
 ] 

Maxim Gekk commented on SPARK-30687:


This feature will come with Spark 3.0 
https://github.com/apache/spark/commit/11e5f1bcd49eec8ab4225d6e68a051b5c6a21cb2 
. It wasn't backported to 2.4.x

> When reading from a file with pre-defined schema and encountering a single 
> value that is not the same type as that of its column , Spark nullifies the 
> entire row
> -
>
> Key: SPARK-30687
> URL: https://issues.apache.org/jira/browse/SPARK-30687
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bao Nguyen
>Priority: Major
>
> When reading from a file with pre-defined schema and encountering a single 
> value that is not the same type as that of its column , Spark nullifies the 
> entire row instead of setting the value at that cell to be null.
>  
> {code:java}
> case class TestModel(
>   num: Double, test: String, mac: String, value: Double
> )
> val schema = 
> ScalaReflection.schemaFor[TestModel].dataType.asInstanceOf[StructType]
> //here's the content of the file test.data
> //1~test~mac1~2
> //1.0~testdatarow2~mac2~non-numeric
> //2~test1~mac1~3
> val ds = spark
>   .read
>   .schema(schema)
>   .option("delimiter", "~")
>   .csv("/test-data/test.data")
> ds.show();
> //the content of data frame. second row is all null. 
> //  ++-++-+
> //  | num| test| mac|value|
> //  ++-++-+
> //  | 1.0| test|mac1|  2.0|
> //  |null| null|null| null|
> //  | 2.0|test1|mac1|  3.0|
> //  ++-++-+
> //should be
> // ++--++-+ 
> // | num| test | mac|value| 
> // ++--++-+ 
> // | 1.0| test |mac1| 2.0 | 
> // |1.0 |testdatarow2  |mac2| null| 
> // | 2.0|test1 |mac1| 3.0 | 
> // ++--++-+{code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30767) from_json changes times of timestmaps by several minutes without error

2020-02-09 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033293#comment-17033293
 ] 

Maxim Gekk edited comment on SPARK-30767 at 2/9/20 9:08 PM:


The default timestamp pattern in JSON datasource specifies only milliseconds 
but your input strings have timestamps in microsecond precision. You can change 
the pattern via:
{code:scala}
from_json(col("json"), struct, Map("timestampFormat" -> 
"-MM-dd'T'HH:mm:ss.SSXXX")
{code}
Just in case, it should work in Spark 3.0 preview and in Spark 2.4.5


was (Author: maxgekk):
The default timestamp pattern in JSON datasource specifies only milliseconds 
but your input strings have timestamps in microsecond precision. You can change 
the pattern via:
{code:scala}
from_json(col("json"), struct, Map("timestampFormat" -> 
"-MM-dd'T'HH:mm:ss.SSXXX")
{code}

> from_json changes times of timestmaps by several minutes without error 
> ---
>
> Key: SPARK-30767
> URL: https://issues.apache.org/jira/browse/SPARK-30767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: We ran the example code with Spark 2.4.4 via Azure 
> Databricks with Databricks Runtime version 6.3 within an interactive cluster. 
> We encountered the issue first on a Job Cluster running a streaming 
> application on Databricks Runtime Version 5.4.
>Reporter: Benedikt Maria Beckermann
>Priority: Major
>  Labels: corruption
>
> When a json text column includes a timestamp and the timestamp has a format 
> like {{2020-01-25T06:39:45.887429Z}}, the function 
> {{from_json(Column,StructType)}} is able to infer a timestamp but that 
> timestamp is changed by several minutes. 
> Spark does not throw any kind of error but continues to run with the 
> invalidated timestamp. 
> The following scala snipped is able to reproduce the issue.
>  
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val df = Seq("""{"time":"2020-01-25T06:39:45.887429Z"}""").toDF("json")
> val struct = new StructType().add("time", TimestampType, nullable = true)
> val timeDF = df
>   .withColumn("time (string)", get_json_object(col("json"), "$.time"))
>   .withColumn("time casted directly (CORRECT)", col("time 
> (string)").cast(TimestampType))
>   .withColumn("time casted via struct (INVALID)", from_json(col("json"), 
> struct))
> display(timeDF)
> {code}
> Output: 
> ||json||time (string)||time casted directly (CORRECT)||time casted via struct 
> (INVALID)
> |{"time":"2020-01-25T06:39:45.887429Z"}|2020-01-25T06:39:45.887429Z|2020-01-25T06:39:45.887+|{"time":"2020-01-25T06:54:32.429+"}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30767) from_json changes times of timestmaps by several minutes without error

2020-02-09 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033294#comment-17033294
 ] 

Maxim Gekk commented on SPARK-30767:


Also 2.4.4 supports only `SSS` for second fractions. That was fixed by 
SPARK-29904 and SPARK-29949 in 2.4.5. Please, ask Databricks support which DBR 
versions will have those fixes.

> from_json changes times of timestmaps by several minutes without error 
> ---
>
> Key: SPARK-30767
> URL: https://issues.apache.org/jira/browse/SPARK-30767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: We ran the example code with Spark 2.4.4 via Azure 
> Databricks with Databricks Runtime version 6.3 within an interactive cluster. 
> We encountered the issue first on a Job Cluster running a streaming 
> application on Databricks Runtime Version 5.4.
>Reporter: Benedikt Maria Beckermann
>Priority: Major
>  Labels: corruption
>
> When a json text column includes a timestamp and the timestamp has a format 
> like {{2020-01-25T06:39:45.887429Z}}, the function 
> {{from_json(Column,StructType)}} is able to infer a timestamp but that 
> timestamp is changed by several minutes. 
> Spark does not throw any kind of error but continues to run with the 
> invalidated timestamp. 
> The following scala snipped is able to reproduce the issue.
>  
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val df = Seq("""{"time":"2020-01-25T06:39:45.887429Z"}""").toDF("json")
> val struct = new StructType().add("time", TimestampType, nullable = true)
> val timeDF = df
>   .withColumn("time (string)", get_json_object(col("json"), "$.time"))
>   .withColumn("time casted directly (CORRECT)", col("time 
> (string)").cast(TimestampType))
>   .withColumn("time casted via struct (INVALID)", from_json(col("json"), 
> struct))
> display(timeDF)
> {code}
> Output: 
> ||json||time (string)||time casted directly (CORRECT)||time casted via struct 
> (INVALID)
> |{"time":"2020-01-25T06:39:45.887429Z"}|2020-01-25T06:39:45.887429Z|2020-01-25T06:39:45.887+|{"time":"2020-01-25T06:54:32.429+"}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30767) from_json changes times of timestmaps by several minutes without error

2020-02-09 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033293#comment-17033293
 ] 

Maxim Gekk commented on SPARK-30767:


The default timestamp pattern in JSON datasource specifies only milliseconds 
but your input strings have timestamps in microsecond precision. You can change 
the pattern via:
{code:scala}
from_json(col("json"), struct, Map("timestampFormat" -> 
"-MM-dd'T'HH:mm:ss.SSXXX")
{code}

> from_json changes times of timestmaps by several minutes without error 
> ---
>
> Key: SPARK-30767
> URL: https://issues.apache.org/jira/browse/SPARK-30767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
> Environment: We ran the example code with Spark 2.4.4 via Azure 
> Databricks with Databricks Runtime version 6.3 within an interactive cluster. 
> We encountered the issue first on a Job Cluster running a streaming 
> application on Databricks Runtime Version 5.4.
>Reporter: Benedikt Maria Beckermann
>Priority: Major
>  Labels: corruption
>
> When a json text column includes a timestamp and the timestamp has a format 
> like {{2020-01-25T06:39:45.887429Z}}, the function 
> {{from_json(Column,StructType)}} is able to infer a timestamp but that 
> timestamp is changed by several minutes. 
> Spark does not throw any kind of error but continues to run with the 
> invalidated timestamp. 
> The following scala snipped is able to reproduce the issue.
>  
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> val df = Seq("""{"time":"2020-01-25T06:39:45.887429Z"}""").toDF("json")
> val struct = new StructType().add("time", TimestampType, nullable = true)
> val timeDF = df
>   .withColumn("time (string)", get_json_object(col("json"), "$.time"))
>   .withColumn("time casted directly (CORRECT)", col("time 
> (string)").cast(TimestampType))
>   .withColumn("time casted via struct (INVALID)", from_json(col("json"), 
> struct))
> display(timeDF)
> {code}
> Output: 
> ||json||time (string)||time casted directly (CORRECT)||time casted via struct 
> (INVALID)
> |{"time":"2020-01-25T06:39:45.887429Z"}|2020-01-25T06:39:45.887429Z|2020-01-25T06:39:45.887+|{"time":"2020-01-25T06:54:32.429+"}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30767) from_json changes times of timestmaps by several minutes without error

2020-02-09 Thread Benedikt Maria Beckermann (Jira)
Benedikt Maria Beckermann created SPARK-30767:
-

 Summary: from_json changes times of timestmaps by several minutes 
without error 
 Key: SPARK-30767
 URL: https://issues.apache.org/jira/browse/SPARK-30767
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
 Environment: We ran the example code with Spark 2.4.4 via Azure 
Databricks with Databricks Runtime version 6.3 within an interactive cluster. 
We encountered the issue first on a Job Cluster running a streaming application 
on Databricks Runtime Version 5.4.
Reporter: Benedikt Maria Beckermann


When a json text column includes a timestamp and the timestamp has a format 
like {{2020-01-25T06:39:45.887429Z}}, the function 
{{from_json(Column,StructType)}} is able to infer a timestamp but that 
timestamp is changed by several minutes. 
Spark does not throw any kind of error but continues to run with the 
invalidated timestamp. 

The following scala snipped is able to reproduce the issue.
 
{code:scala}
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val df = Seq("""{"time":"2020-01-25T06:39:45.887429Z"}""").toDF("json")

val struct = new StructType().add("time", TimestampType, nullable = true)

val timeDF = df
  .withColumn("time (string)", get_json_object(col("json"), "$.time"))
  .withColumn("time casted directly (CORRECT)", col("time 
(string)").cast(TimestampType))
  .withColumn("time casted via struct (INVALID)", from_json(col("json"), 
struct))

display(timeDF)
{code}


Output: 
||json||time (string)||time casted directly (CORRECT)||time casted via struct 
(INVALID)
|{"time":"2020-01-25T06:39:45.887429Z"}|2020-01-25T06:39:45.887429Z|2020-01-25T06:39:45.887+|{"time":"2020-01-25T06:54:32.429+"}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30766) Wrong truncation of old timestamps to hours and days

2020-02-09 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-30766:
--

 Summary: Wrong truncation of old timestamps to hours and days
 Key: SPARK-30766
 URL: https://issues.apache.org/jira/browse/SPARK-30766
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


The date_trunc() function incorrectly truncates old timestamps to HOUR and DAY:
{code:scala}
scala> import  org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> :paste
// Entering paste mode (ctrl-D to finish)

Seq("0010-01-01 01:02:03.123456").toDF()
.select($"value".cast("timestamp").as("ts"))
.select(date_trunc("HOUR", $"ts").cast("string"))
.show(false)

// Exiting paste mode, now interpreting.

++
|CAST(date_trunc(HOUR, ts) AS STRING)|
++
|0010-01-01 01:30:17 |
++
{code}
the result must be *0010-01-01 01:00:00*
{code:scala}
scala> :paste
// Entering paste mode (ctrl-D to finish)

Seq("0010-01-01 01:02:03.123456").toDF()
.select($"value".cast("timestamp").as("ts"))
.select(date_trunc("DAY", $"ts").cast("string"))
.show(false)

// Exiting paste mode, now interpreting.

+---+
|CAST(date_trunc(DAY, ts) AS STRING)|
+---+
|0010-01-01 23:30:17|
+---+
{code}
the result must be *0010-01-01 00:00:00*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30765) Refine baes class abstraction code style

2020-02-09 Thread Xin Wu (Jira)
Xin Wu created SPARK-30765:
--

 Summary: Refine baes class abstraction code style
 Key: SPARK-30765
 URL: https://issues.apache.org/jira/browse/SPARK-30765
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Xin Wu


When doing base operator abstraction work, I found there are still some code 
snippet is  inconsistent with other abstraction code style.
 
Case 1, override keyword missed for some fields in derived classes. The 
compiler will not capture it if we rename some fields in the future.
[https://github.com/apache/spark/pull/27368#discussion_r376694045]
 
 
Case 2, inconsistent abstract class definition. The updated style will simplify 
derived class definition.
[https://github.com/apache/spark/pull/27368#discussion_r375061952]
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30764) Improve the readability of EXPLAIN FORMATTED style

2020-02-09 Thread Xin Wu (Jira)
Xin Wu created SPARK-30764:
--

 Summary: Improve the readability of EXPLAIN FORMATTED style
 Key: SPARK-30764
 URL: https://issues.apache.org/jira/browse/SPARK-30764
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Xin Wu


The style of EXPLAIN FORMATTED output needs to be improved. We’ve already got 
some observations/ideas in
[https://github.com/apache/spark/pull/27368#discussion_r376694496]. 
 
TODOs:
1.Using comma as the separator is not clear, especially commas are used inside 
the expressions too.
2.Show the column counts first? For example, `Results [4]: …`
3.Currently the attribute names are automatically generated, this need to 
refined.
4.Add arguments field in common implications as EXPLAIN FORMATTED did in 
QueryPlan
...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30763) Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract

2020-02-09 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30763:
---
Description: 
The current implement of regexp_extract will throws a unprocessed exception 
show below:

SELECT regexp_extract('1a 2b 14m', '
d+')

 
{code:java}
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 1 in stage 22.0 failed 1 times, most recent failure: Lost task 1.0 in 
stage 22.0 (TID 33, 192.168.1.6, executor driver): 
java.lang.IndexOutOfBoundsException: No group 1
[info] at java.util.regex.Matcher.group(Matcher.java:538)
[info] at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
[info] at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info] at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
[info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[info] at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1804)
[info] at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1227)
[info] at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1227)
[info] at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2156)
[info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[info] at org.apache.spark.scheduler.Task.run(Task.scala:127)
[info] at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
[info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
[info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
[info] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info] at java.lang.Thread.run(Thread.java:748)
{code}
 

I think should treat this exception well.

  was:
The current implement of regexp_extract will throws a unprocessed exception 
show below:

SELECT regexp_extract('1a 2b 14m', '\\d+')

 
{code:java}
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 1 in stage 22.0 failed 1 times, most recent failure: Lost task 1.0 in 
stage 22.0 (TID 33, 192.168.1.6, executor driver): 
java.lang.IndexOutOfBoundsException: No group 2
[info] at java.util.regex.Matcher.group(Matcher.java:538)
[info] at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
[info] at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info] at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
[info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[info] at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1804)
[info] at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1227)
[info] at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1227)
[info] at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2156)
[info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[info] at org.apache.spark.scheduler.Task.run(Task.scala:127)
[info] at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
[info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
[info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
[info] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info] at java.lang.Thread.run(Thread.java:748)
{code}
 

I think should treat this exception well.


> Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract
> -
>
> Key: SPARK-30763
> URL: https://issues.apache.org/jira/browse/SPARK-30763
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> The current implement of regexp_extract will throws a unprocessed exception 
> show below:
> SELECT regexp_extract('1a 2b 14m', '
> d+')
>  
> {code:java}
> [info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 22.0 failed 1 times, most recent failure: Lost task 1.0 in 
> stage 22.0 (TID 33, 192.168.1.6, executor driver): 
> java.lang.IndexOutOfBoundsException: No group 1
> [info] at java.util.regex.Matcher.group(Matcher.java:538)
> [info] at 
> 

[jira] [Created] (SPARK-30763) Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract

2020-02-09 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-30763:
--

 Summary: Fix java.lang.IndexOutOfBoundsException No group 1 for 
regexp_extract
 Key: SPARK-30763
 URL: https://issues.apache.org/jira/browse/SPARK-30763
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5, 3.0.0
Reporter: jiaan.geng


The current implement of regexp_extract will throws a unprocessed exception 
show below:

SELECT regexp_extract('1a 2b 14m', '\\d+')

 
{code:java}
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 1 in stage 22.0 failed 1 times, most recent failure: Lost task 1.0 in 
stage 22.0 (TID 33, 192.168.1.6, executor driver): 
java.lang.IndexOutOfBoundsException: No group 2
[info] at java.util.regex.Matcher.group(Matcher.java:538)
[info] at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
[info] at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info] at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
[info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
[info] at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1804)
[info] at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1227)
[info] at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1227)
[info] at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2156)
[info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
[info] at org.apache.spark.scheduler.Task.run(Task.scala:127)
[info] at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
[info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
[info] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
[info] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[info] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[info] at java.lang.Thread.run(Thread.java:748)
{code}
 

I think should treat this exception well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30510) Publicly document options under spark.sql.*

2020-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30510.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27459
[https://github.com/apache/spark/pull/27459]

> Publicly document options under spark.sql.*
> ---
>
> Key: SPARK-30510
> URL: https://issues.apache.org/jira/browse/SPARK-30510
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, 
> but it doesn't appear to be documented in [the expected 
> place|http://spark.apache.org/docs/2.4.4/configuration.html]. In fact, none 
> of the options under {{spark.sql.*}} that are intended for users are 
> documented on spark.apache.org/docs.
> We should add a new documentation page for these options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30510) Publicly document options under spark.sql.*

2020-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30510:


Assignee: Hyukjin Kwon

> Publicly document options under spark.sql.*
> ---
>
> Key: SPARK-30510
> URL: https://issues.apache.org/jira/browse/SPARK-30510
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> SPARK-20236 added a new option, {{spark.sql.sources.partitionOverwriteMode}}, 
> but it doesn't appear to be documented in [the expected 
> place|http://spark.apache.org/docs/2.4.4/configuration.html]. In fact, none 
> of the options under {{spark.sql.*}} that are intended for users are 
> documented on spark.apache.org/docs.
> We should add a new documentation page for these options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30762) Add dtype="float32" support to vector_to_array UDF

2020-02-09 Thread Liang Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033151#comment-17033151
 ] 

Liang Zhang commented on SPARK-30762:
-

I'm now working on this issue.

> Add dtype="float32" support to vector_to_array UDF
> --
>
> Key: SPARK-30762
> URL: https://issues.apache.org/jira/browse/SPARK-30762
> Project: Spark
>  Issue Type: Story
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Liang Zhang
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Previous PR: 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/functions.py]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org