[jira] [Commented] (SPARK-31420) Infinite timeline redraw in job details page

2020-04-12 Thread Kousuke Saruta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081690#comment-17081690
 ] 

Kousuke Saruta commented on SPARK-31420:


Maybe, all the versions which use vis.js (>= 1.4.0) have this issue.

> Infinite timeline redraw in job details page
> 
>
> Key: SPARK-31420
> URL: https://issues.apache.org/jira/browse/SPARK-31420
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Kousuke Saruta
>Priority: Major
> Attachments: timeline.mov
>
>
> In the job page, the timeline section keeps changing the position style and 
> shaking. We can see that there is a warning "infinite loop in redraw" from 
> the console, which can be related to 
> https://github.com/visjs/vis-timeline/issues/17
> I am using the history server with the events under 
> "core/src/test/resources/spark-events" to reproduce.
> I have also uploaded a screen recording.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value

2020-04-12 Thread Sathyaprakash Govindasamy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081694#comment-17081694
 ] 

Sathyaprakash Govindasamy commented on SPARK-29854:
---

In Spark 3.0, you can change the behaviour by using  configuration 
*spark.sql.ansi.enabled* ( _When `spark.sql.ansi.enabled` is set to `true`, 
Spark SQL follows the standard in basic behaviours (e.g., arithmetic 
operations, type conversion, SQL functions and SQL parsing)_)

If spark.sql.ansi.enabled=false, which is default, then spark does not throw 
any arithmetic error like overflowException. For the given sql query, it 
basically executes below statement to cast the input value to int. Since 
returned value is negative, passing negative value to lpad/rpad gives empty 
string as output for the sql query mentioned in this issue
{code:java}
scala> BigDecimal("5000").longValue.toInt
res7: Int = -796917760{code}
if you set spark.sql.ansi.enabled=true, then it throws below error.  
{code:java}
java.lang.ArithmeticException: Casting 5000 to int causes 
overflow
  at org.apache.spark.sql.types.Decimal.overflowException(Decimal.scala:254)
  at org.apache.spark.sql.types.Decimal.roundToInt(Decimal.scala:317)
  at org.apache.spark.sql.types.DecimalExactNumeric$.toInt(numerics.scala:183)
  at org.apache.spark.sql.types.DecimalExactNumeric$.toInt(numerics.scala:182)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToInt$13(Cast.scala:518)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToInt$13$adapted(Cast.scala:518)
  at 
org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:808)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461)
  at 
org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:686)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:457)
  at 
org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:52){code}

> lpad and rpad built in function not throw Exception for invalid len value
> -
>
> Key: SPARK-29854
> URL: https://issues.apache.org/jira/browse/SPARK-29854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Spark Returns Empty String)
> {code}
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> lpad('hihhh', 5000, '');
>  ++
> |lpad(hihhh, CAST(5000 AS INT), 
> )|
> ++
> ++
> Hive:
> SELECT lpad('hihhh', 5000, 
> '');
>  Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
> INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)
> PostgreSQL
> function lpad(unknown, numeric, unknown) does not exist
>  
> Expected output:
> In Spark also it should throw Exception like Hive
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31425) UnsafeKVExternalSorter should also respect UnsafeAlignedOffset

2020-04-12 Thread wuyi (Jira)
wuyi created SPARK-31425:


 Summary: UnsafeKVExternalSorter should also respect 
UnsafeAlignedOffset
 Key: SPARK-31425
 URL: https://issues.apache.org/jira/browse/SPARK-31425
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: wuyi


Since BytesToBytesMap respects UnsafeAlignedOffset when writing the record, 
UnsafeKVExternalSorter should also respect UnsafeAlignedOffset when reading the 
record otherwise it will causes a data correctness issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31426) Regression in loading/saving timestamps from/to ORC files

2020-04-12 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-31426:
--

 Summary: Regression in loading/saving timestamps from/to ORC files
 Key: SPARK-31426
 URL: https://issues.apache.org/jira/browse/SPARK-31426
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Here are results of DateTimeRebaseBenchmark on the current master branch:
{code}
Save timestamps to ORC:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

after 158259877  59877  
 0  1.7 598.8   0.0X
before 1582   61361  61361  
 0  1.6 613.6   0.0X

Load timestamps from ORC: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

after 1582, vec off   48197  48288 
118  2.1 482.0   1.0X
after 1582, vec on38247  38351 
128  2.6 382.5   1.3X
before 1582, vec off  53179  53359 
249  1.9 531.8   0.9X
before 1582, vec on   44076  44268 
269  2.3 440.8   1.1X
{code}

The results of the same benchmark on Spark 2.4.6-SNAPSHOT:
{code}
Save timestamps to ORC:   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

after 158218858  18858  
 0  5.3 188.6   1.0X
before 1582   18508  18508  
 0  5.4 185.1   1.0X

Load timestamps from ORC: Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

after 1582, vec off   14063  14177 
143  7.1 140.6   1.0X
after 1582, vec on 5955   6029 
100 16.8  59.5   2.4X
before 1582, vec off  14119  14126  
 7  7.1 141.2   1.0X
before 1582, vec on5991   6007  
25 16.7  59.9   2.3X
{code}
 Here is the PR with DateTimeRebaseBenchmark backported to 2.4: 
https://github.com/MaxGekk/spark/pull/27



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31348) Document Join in SQL Reference

2020-04-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-31348:


Assignee: Huaxin Gao

> Document Join in SQL Reference
> --
>
> Key: SPARK-31348
> URL: https://issues.apache.org/jira/browse/SPARK-31348
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Document Join in SQL Reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31348) Document Join in SQL Reference

2020-04-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31348.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28121
[https://github.com/apache/spark/pull/28121]

> Document Join in SQL Reference
> --
>
> Key: SPARK-31348
> URL: https://issues.apache.org/jira/browse/SPARK-31348
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Document Join in SQL Reference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29002) Avoid changing SMJ to BHJ if the build side has a high ratio of empty partitions

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-29002:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> Avoid changing SMJ to BHJ if the build side has a high ratio of empty 
> partitions
> 
>
> Key: SPARK-29002
> URL: https://issues.apache.org/jira/browse/SPARK-29002
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29060) Add tree traversal helper for adaptive spark plans

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-29060:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> Add tree traversal helper for adaptive spark plans
> --
>
> Key: SPARK-29060
> URL: https://issues.apache.org/jira/browse/SPARK-29060
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9853) Optimize shuffle fetch of contiguous partition IDs

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-9853:
---
Parent Issue: SPARK-31412  (was: SPARK-9850)

> Optimize shuffle fetch of contiguous partition IDs
> --
>
> Key: SPARK-9853
> URL: https://issues.apache.org/jira/browse/SPARK-9853
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Reporter: Matei Alexandru Zaharia
>Assignee: Yuanjian Li
>Priority: Minor
> Fix For: 3.0.0
>
>
> On the map side, we should be able to serve a block representing multiple 
> partition IDs in one block manager request



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29759) LocalShuffleReaderExec.outputPartitioning should use the corrected attributes

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-29759:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> LocalShuffleReaderExec.outputPartitioning should use the corrected attributes
> -
>
> Key: SPARK-29759
> URL: https://issues.apache.org/jira/browse/SPARK-29759
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31427) Spark Structure streaming read data twice per every micro-batch.

2020-04-12 Thread Nick Hryhoriev (Jira)
Nick Hryhoriev created SPARK-31427:
--

 Summary: Spark Structure streaming read data twice per every 
micro-batch.
 Key: SPARK-31427
 URL: https://issues.apache.org/jira/browse/SPARK-31427
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Nick Hryhoriev


I have a very strange issue with spark structure streaming. Spark structure 
streaming creates two spark jobs for every micro-batch. As a result, read data 
from Kafka twice. Here is a simple code snippet.

 
{code:java}
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.streaming.Trigger

object CheckHowSparkReadFromKafka {
  def main(args: Array[String]): Unit = {
val session = SparkSession.builder()
  .config(new SparkConf()
.setAppName(s"simple read from kafka with repartition")
.setMaster("local[*]")
.set("spark.driver.host", "localhost"))
  .getOrCreate()
val testPath = "/tmp/spark-test"
FileSystem.get(session.sparkContext.hadoopConfiguration).delete(new 
Path(testPath), true)
import session.implicits._
val stream = session
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers","kafka-20002-prod:9092")
  .option("subscribe", "topic")
  .option("maxOffsetsPerTrigger", 1000)
  .option("failOnDataLoss", false)
  .option("startingOffsets", "latest")
  .load()
  .repartitionByRange( $"offset")
  .writeStream
  .option("path", testPath + "/data")
  .option("checkpointLocation", testPath + "/checkpoint")
  .format("parquet")
  .trigger(Trigger.ProcessingTime(10.seconds))
  .start()
stream.processAllAvailable()
{code}
This happens because if {{.repartitionByRange( $"offset")}}, if I remove this 
line, all good. But with spark create two jobs, one with 1 stage just read from 
Kafka, the second with 3 stage read -> shuffle -> write. So the result of the 
first job never used.

This has a significant impact on performance. Some of my Kafka topics have 1550 
partitions, so read them twice is a big deal. In case I add cache, things going 
better, but this is not a way for me. In local mode, the first job in batch 
takes less than 0.1 ms, except batch with index 0. But in YARN cluster and 
Messos both jobs fully expected and on my topics take near 1.2 min.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29893) Improve the local reader performance by changing the task number from 1 to multi

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-29893:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> Improve the local reader performance by changing the task number from 1 to 
> multi
> 
>
> Key: SPARK-29893
> URL: https://issues.apache.org/jira/browse/SPARK-29893
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> The currently local reader read all the partition of map stage only using 1 
> task, which may cause the performance degradation. This PR will improve the 
> performance by using multi tasks instead of one task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29906) Reading of csv file fails with adaptive execution turned on

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-29906:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> Reading of csv file fails with adaptive execution turned on
> ---
>
> Key: SPARK-29906
> URL: https://issues.apache.org/jira/browse/SPARK-29906
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: build from master today nov 14
> commit fca0a6c394990b86304a8f9a64bf4c7ec58abbd6 (HEAD -> master, 
> upstream/master, upstream/HEAD)
> Author: Kevin Yu 
> Date:   Thu Nov 14 14:58:32 2019 -0600
> build using:
> $ dev/make-distribution.sh --tgz -Phadoop-2.7 -Dhadoop.version=2.7.4 -Pyarn
> deployed on AWS EMR 5.28 with 10 m5.xlarge slaves 
> in spark-env.sh:
> HADOOP_CONF_DIR=/etc/hadoop/conf
> in spark-defaults.conf:
> spark.master yarn
> spark.submit.deployMode client
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.hadoop.yarn.timeline-service.enabled false
> spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
> spark.driver.extraLibraryPath 
> /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
> spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/hadoop-lzo.jar
> spark.executor.extraLibraryPath 
> /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
>Reporter: koert kuipers
>Assignee: Wenchen Fan
>Priority: Minor
>  Labels: correctness
> Fix For: 3.0.0
>
>
> we observed an issue where spark seems to confuse a data line (not the first 
> line of the csv file) for the csv header when it creates the schema.
> {code}
> $ wget http://download.cms.gov/openpayments/PGYR13_P062819.ZIP
> $ unzip PGYR13_P062819.ZIP
> $ hadoop fs -put OP_DTL_GNRL_PGYR2013_P06282019.csv
> $ spark-3.0.0-SNAPSHOT-bin-2.7.4/bin/spark-shell --conf 
> spark.sql.adaptive.enabled=true --num-executors 10
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 19/11/15 00:26:47 WARN yarn.Client: Neither spark.yarn.jars nor 
> spark.yarn.archive is set, falling back to uploading libraries under 
> SPARK_HOME.
> Spark context Web UI available at http://ip-xx-xxx-x-xxx.ec2.internal:4040
> Spark context available as 'sc' (master = yarn, app id = 
> application_1573772077642_0006).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_222)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> spark.read.format("csv").option("header", 
> true).option("enforceSchema", 
> false).load("OP_DTL_GNRL_PGYR2013_P06282019.csv").show(1)
> 19/11/15 00:27:10 WARN util.package: Truncated the string representation of a 
> plan since it was too large. This behavior can be adjusted by setting 
> 'spark.sql.debug.maxToStringFields'.
> [Stage 2:>(0 + 10) / 
> 17]19/11/15 00:27:11 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 
> 2.0 (TID 35, ip-xx-xxx-x-xxx.ec2.internal, executor 1): 
> java.lang.IllegalArgumentException: CSV header does not conform to the schema.
>  Header: Change_Type, Covered_Recipient_Type, Teaching_Hospital_CCN, 
> Teaching_Hospital_ID, Teaching_Hospital_Name, Physician_Profile_ID, 
> Physician_First_Name, Physician_Middle_Name, Physician_Last_Name, 
> Physician_Name_Suffix, Recipient_Primary_Business_Street_Address_Line1, 
> Recipient_Primary_Business_Street_Address_Line2, Recipient_City, 
> Recipient_State, Recipient_Zip_Code, Recipient_Country, Recipient_Province, 
> Recipient_Postal_Code, Physician_Primary_Type, Physician_Specialty, 
> Physician_License_State_code1, Physician_License_State_code2, 
> Physician_License_State_code3, Physician_License_State_code4, 
> Physician_License_State_code5, 
> Submitting_Applicable_Manufacturer_or_Applicable_GPO_Name, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_ID, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Name, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_State, 
> Applicable_Manufacturer_or_Applicable_GPO_Making_Payment_Country, 
> Total_Amount_of_Payment_USDollars, Date_of_Payment, 
> Number_of_Payments_Included_in_Total_Amount, 
> Form_of_Payment_or_Transfer_of_Value, Nature_of_Payment_or_Transfer_of_Value, 
> City_of_Travel, State_of_Travel, Country_of_Travel, 
> Physician_Ownership_Indicator, Third_Party_Payment_Recipient_Indicator, 
> Name_of_Third_Party_Entity_Receiving_Payment_or_Transfer_of_Value, 
> Charity_Indi

[jira] [Updated] (SPARK-30291) Catch the exception when do materialize in AQE

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30291:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> Catch the exception when do materialize in AQE
> --
>
> Key: SPARK-30291
> URL: https://issues.apache.org/jira/browse/SPARK-30291
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> We need catch the exception when doing materialize in the QueryStage of AQE. 
> Then user can get more information about the exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30315) Add adaptive execution context

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-30315.
-
Fix Version/s: 3.0.0
 Assignee: Wei Xue
   Resolution: Fixed

> Add adaptive execution context
> --
>
> Key: SPARK-30315
> URL: https://issues.apache.org/jira/browse/SPARK-30315
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Minor
> Fix For: 3.0.0
>
>
> Create an adaptive execution context class to wrap objects shared across main 
> query and sub-queries. This refactoring will reduce the number of parameters 
> used to initialize \{{AdaptiveSparkPlanExec}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30315) Add adaptive execution context

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30315:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Task)

> Add adaptive execution context
> --
>
> Key: SPARK-30315
> URL: https://issues.apache.org/jira/browse/SPARK-30315
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Minor
> Fix For: 3.0.0
>
>
> Create an adaptive execution context class to wrap objects shared across main 
> query and sub-queries. This refactoring will reduce the number of parameters 
> used to initialize \{{AdaptiveSparkPlanExec}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30307) remove ReusedQueryStageExec

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30307:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> remove ReusedQueryStageExec
> ---
>
> Key: SPARK-30307
> URL: https://issues.apache.org/jira/browse/SPARK-30307
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30188) Fix tests when enable Adaptive Query Execution

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30188:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Test)

> Fix tests when enable Adaptive Query Execution
> --
>
> Key: SPARK-30188
> URL: https://issues.apache.org/jira/browse/SPARK-30188
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> Fix the failed unit tests when enable Adaptive Query Execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30403) Fix the NoSuchElementException exception when enable AQE with InSubquery use case

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30403:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> Fix the NoSuchElementException exception when enable AQE with InSubquery use 
> case
> -
>
> Key: SPARK-30403
> URL: https://issues.apache.org/jira/browse/SPARK-30403
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> After merged [https://github.com/apache/spark/pull/25854], we also need to 
> handle the Insubquery case when build SubqueryMap in InsertAdaptiveSparkPlan 
> rule. Otherwise we will  get the NoSuchElementException exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30407) reset the metrics info of AdaptiveSparkPlanExec plan when enable aqe

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30407:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> reset the metrics info of AdaptiveSparkPlanExec plan when enable aqe
> 
>
> Key: SPARK-30407
> URL: https://issues.apache.org/jira/browse/SPARK-30407
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> Working on [https://github.com/apache/spark/pull/26813]. With on AQE, the 
> metric info of AdaptiveSparkPlanExec does not reset when running the test 
> DataFrameCallbackSuite#get numRows metrics by callback.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30549) Fix the subquery metrics showing issue in UI When enable AQE

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30549:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> Fix the subquery metrics showing issue in UI When enable AQE
> 
>
> Key: SPARK-30549
> URL: https://issues.apache.org/jira/browse/SPARK-30549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> After merged [https://github.com/apache/spark/pull/25316], the subquery 
> metrics can not be shown in UI when enable AQE. This PR will fix the subquery 
> shown issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30571) coalesce shuffle reader with splitting shuffle fetch request fails

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30571:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> coalesce shuffle reader with splitting shuffle fetch request fails
> --
>
> Key: SPARK-30571
> URL: https://issues.apache.org/jira/browse/SPARK-30571
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31424) Rename AdaptiveSparkPlanHelper.collectInPlanAndSubqueries to collectWithSubqueries

2020-04-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31424:
-

Assignee: Xiao Li

> Rename AdaptiveSparkPlanHelper.collectInPlanAndSubqueries to 
> collectWithSubqueries
> --
>
> Key: SPARK-31424
> URL: https://issues.apache.org/jira/browse/SPARK-31424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Xiao Li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31424) Rename AdaptiveSparkPlanHelper.collectInPlanAndSubqueries to collectWithSubqueries

2020-04-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31424.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28193
[https://github.com/apache/spark/pull/28193]

> Rename AdaptiveSparkPlanHelper.collectInPlanAndSubqueries to 
> collectWithSubqueries
> --
>
> Key: SPARK-31424
> URL: https://issues.apache.org/jira/browse/SPARK-31424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Xiao Li
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31416) Check more strictly that a field name can be used as a valid Java identifier for codegen

2020-04-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31416:
--
Parent: SPARK-29194
Issue Type: Sub-task  (was: Improvement)

> Check more strictly that a field name can be used as a valid Java identifier 
> for codegen
> 
>
> Key: SPARK-31416
> URL: https://issues.apache.org/jira/browse/SPARK-31416
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.0.0
>
>
> ScalaReflection checks that a field name can be used as a valid Java 
> identifier by checking whether the field name is not a reserved keyword.
> But, in the current implementation, enum is missed.
> Further, some characters including numeric literals are not used as valid 
> identifiers but not checked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31416) Check more strictly that a field name can be used as a valid Java identifier for codegen

2020-04-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31416.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28184
[https://github.com/apache/spark/pull/28184]

> Check more strictly that a field name can be used as a valid Java identifier 
> for codegen
> 
>
> Key: SPARK-31416
> URL: https://issues.apache.org/jira/browse/SPARK-31416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.0.0
>
>
> ScalaReflection checks that a field name can be used as a valid Java 
> identifier by checking whether the field name is not a reserved keyword.
> But, in the current implementation, enum is missed.
> Further, some characters including numeric literals are not used as valid 
> identifiers but not checked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31416) Check more strictly that a field name can be used as a valid Java identifier for codegen

2020-04-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31416:
--
Affects Version/s: (was: 3.1.0)
   3.0.0

> Check more strictly that a field name can be used as a valid Java identifier 
> for codegen
> 
>
> Key: SPARK-31416
> URL: https://issues.apache.org/jira/browse/SPARK-31416
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.0.0
>
>
> ScalaReflection checks that a field name can be used as a valid Java 
> identifier by checking whether the field name is not a reserved keyword.
> But, in the current implementation, enum is missed.
> Further, some characters including numeric literals are not used as valid 
> identifiers but not checked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30719) AQE should not issue a "not supported" warning for queries being by-passed

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30719:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> AQE should not issue a "not supported" warning for queries being by-passed
> --
>
> Key: SPARK-30719
> URL: https://issues.apache.org/jira/browse/SPARK-30719
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 3.0.0
>
>
> This is a follow up for [https://github.com/apache/spark/pull/26813].
> AQE bypasses queries that don't have exchanges or subqueries. This is not a 
> limitation and it is different from queries that are not supported in AQE. 
> Issuing a warning in this case can be confusing and annoying.
> It would also be good to add an internal conf for this bypassing behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30801) Subqueries should not be AQE-ed if main query is not

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30801:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> Subqueries should not be AQE-ed if main query is not
> 
>
> Key: SPARK-30801
> URL: https://issues.apache.org/jira/browse/SPARK-30801
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently there are unsupported queries by AQE, e.g., queries that contain 
> DPP filters. But if the main query is unsupported while the sub-query is, the 
> subquery itself will be AQE-ed, which can lead to performance regressions due 
> to missed opportunity of sub-query reuse.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30751) Combine the skewed readers into one in AQE skew join optimizations

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30751:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> Combine the skewed readers into one in AQE skew join optimizations
> --
>
> Key: SPARK-30751
> URL: https://issues.apache.org/jira/browse/SPARK-30751
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> Assume we have N partitions based on the original join keys, and for a 
> specific partition id {{Pi}} (i = 1 to N), we slice the left partition into 
> {{Li}} sub-partitions (L = 1 if no skew; L > 1 if skewed), the right 
> partition into {{Mi}} sub-partitions (M = 1 if no skew; M > 1 if skewed). 
> With the current approach, we’ll end up with a sum of {{Li}} * {{Mi}} (i = 1 
> to N where Li > 1 or Mi > 1) plus one (for the rest of the partitions without 
> skew) joins. *This can be a serious performance concern as the size of the 
> query plan now depends on the number and size of skewed partitions.*
> Now instead of generating so many joins we can create a “repeated” reader for 
> either side of the join so that:
>  # for the left side, with each partition id Pi and any given slice {{Sj}} in 
> {{Pi}} (j = 1 to Li), it generates {{Mi}} repeated partitions with respective 
> join keys as {{PiSjT1}}, {{PiSjT2}}, …, {{PiSjTm}}
>  # for the right side, with each partition id Pi and any given slice {{Tk}} 
> in {{Pi}} (k = 1 to Mi), it generates {{Li}} repeated partitions with 
> respective join keys as {{PiS1Tk}}, {{PiS2Tk}}, …, {{PiSlTk}}
> That way, we can have one SMJ for all the partitions and only one type of 
> special reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30906) Turning off AQE in CacheManager is not thread-safe

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30906:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> Turning off AQE in CacheManager is not thread-safe
> --
>
> Key: SPARK-30906
> URL: https://issues.apache.org/jira/browse/SPARK-30906
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Blocker
> Fix For: 3.0.0
>
>
> This is fix for https://issues.apache.org/jira/browse/SPARK-30188
> And it should have been turned off for "recache" too



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30991) Refactor AQE readers and RDDs

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30991:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> Refactor AQE readers and RDDs
> -
>
> Key: SPARK-30991
> URL: https://issues.apache.org/jira/browse/SPARK-30991
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30922) Remove the max split config after changing the multi sub joins to multi sub partitions

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30922:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> Remove the max split config after changing the multi sub joins to multi sub 
> partitions
> --
>
> Key: SPARK-30922
> URL: https://issues.apache.org/jira/browse/SPARK-30922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> After merged PR#27493, we not need the 
> "spark.sql.adaptive.skewedJoinOptimization.skewedPartitionMaxSplits" config 
> to resolve the ui issue when split more sub joins. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30999) Don't cancel a QueryStageExec when it's already finished

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30999:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> Don't cancel a QueryStageExec when it's already finished
> 
>
> Key: SPARK-30999
> URL: https://issues.apache.org/jira/browse/SPARK-30999
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> We should not cancel a QueryStageExec when it's already finished(succeed or 
> failed). In worst case, it will re-failed the failed stage again on cancel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31045) Add config for AQE logging level

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-31045:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> Add config for AQE logging level
> 
>
> Key: SPARK-31045
> URL: https://issues.apache.org/jira/browse/SPARK-31045
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31046) Make more efficient and clean up AQE update UI code

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-31046:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> Make more efficient and clean up AQE update UI code
> ---
>
> Key: SPARK-31046
> URL: https://issues.apache.org/jira/browse/SPARK-31046
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31428) Document Common Table Expression in SQL Reference

2020-04-12 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-31428:
--

 Summary: Document Common Table Expression in SQL Reference
 Key: SPARK-31428
 URL: https://issues.apache.org/jira/browse/SPARK-31428
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Huaxin Gao


Document Common Table Expression in SQL Reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31096) Replace `Array` with `Seq` in AQE `CustomShuffleReaderExec`

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-31096:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> Replace `Array` with `Seq` in AQE `CustomShuffleReaderExec`
> ---
>
> Key: SPARK-31096
> URL: https://issues.apache.org/jira/browse/SPARK-31096
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31206) AQE will use the same SubqueryExec even if subqueryReuseEnabled=false

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-31206:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> AQE will use the same SubqueryExec even if subqueryReuseEnabled=false
> -
>
> Key: SPARK-31206
> URL: https://issues.apache.org/jira/browse/SPARK-31206
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> In `InsertAdaptiveSparkPlan.buildSubqueryMap`, AQE will skip to compile the 
> subquery with the same exprId. As a result, in PlanAdaptiveSubqueries, it 
> will use the same SubqueryExec for the SubqueryExpression with the same 
> exprId.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31384) NPE in OptimizeSkewedJoin when there's a inputRDD of plan has 0 partition

2020-04-12 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-31384:

Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> NPE in OptimizeSkewedJoin when there's a inputRDD of plan has 0 partition
> -
>
> Key: SPARK-31384
> URL: https://issues.apache.org/jira/browse/SPARK-31384
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> When there's a inputRDD of a plan with 0 partitions, rule OptimizeSkewedJoin 
> can hit NPE.
> The issue can be reproduced by below test:
> {code:java}
> withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
>   SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
>   withTempView("t2") {
> // create DataFrame with 0 partition
> spark.createDataFrame(sparkContext.emptyRDD[Row], new 
> StructType().add("b", IntegerType))
>   .createOrReplaceTempView("t2")
> // should run successfully without NPE
> runAdaptiveAndVerifyResult("SELECT * FROM testData2 t1 left semi join t2 
> ON t1.a=t2.b")
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31330) Automatically label PRs based on the paths they touch

2020-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31330:


Assignee: Nicholas Chammas

> Automatically label PRs based on the paths they touch
> -
>
> Key: SPARK-31330
> URL: https://issues.apache.org/jira/browse/SPARK-31330
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>
> We can potentially leverage the added labels to drive testing, review, or 
> other project tooling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31330) Automatically label PRs based on the paths they touch

2020-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31330.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28114
[https://github.com/apache/spark/pull/28114]

> Automatically label PRs based on the paths they touch
> -
>
> Key: SPARK-31330
> URL: https://issues.apache.org/jira/browse/SPARK-31330
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 3.1.0
>
>
> We can potentially leverage the added labels to drive testing, review, or 
> other project tooling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31414) Performance regression with new TimestampFormatter for json and csv

2020-04-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31414:
---

Assignee: Kent Yao

> Performance regression with new TimestampFormatter for json and csv
> ---
>
> Key: SPARK-31414
> URL: https://issues.apache.org/jira/browse/SPARK-31414
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> with benchmark original, where the timestamp values are valid to new parser
> the result is 
> {code:java}
> [info] Running benchmark: Read dates and timestamps
> [info]   Running case: timestamp strings
> [info]   Stopped after 3 iterations, 5781 ms
> [info]   Running case: parse timestamps from Dataset[String]
> [info]   Stopped after 3 iterations, 44764 ms
> [info]   Running case: infer timestamps from Dataset[String]
> [info]   Stopped after 3 iterations, 93764 ms
> [info]   Running case: from_json(timestamp)
> [info]   Stopped after 3 iterations, 59021 ms
> {code}
> when we modify the benchmark to 
> {code:java}
>   def timestampStr: Dataset[String] = {
> spark.range(0, rowsNum, 1, 1).mapPartitions { iter =>
>   iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""")
> }.select($"value".as("timestamp")).as[String]
>   }
>   readBench.addCase("timestamp strings", numIters) { _ =>
> timestampStr.noop()
>   }
>   readBench.addCase("parse timestamps from Dataset[String]", numIters) { 
> _ =>
> spark.read.schema(tsSchema).json(timestampStr).noop()
>   }
>   readBench.addCase("infer timestamps from Dataset[String]", numIters) { 
> _ =>
> spark.read.json(timestampStr).noop()
>   }
> {code}
> where the timestamp values are invalid for the new parser which cause 
> fallback to legacy parser.
> the result is 
> {code:java}
> [info] Running benchmark: Read dates and timestamps
> [info]   Running case: timestamp strings
> [info]   Stopped after 3 iterations, 5623 ms
> [info]   Running case: parse timestamps from Dataset[String]
> [info]   Stopped after 3 iterations, 506637 ms
> [info]   Running case: infer timestamps from Dataset[String]
> [info]   Stopped after 3 iterations, 509076 ms
> {code}
> About 10x perf-regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29799) Split a kafka partition into multiple KafkaRDD partitions in the kafka external plugin for Spark Streaming

2020-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29799.
--
Resolution: Duplicate

> Split a kafka partition into multiple KafkaRDD partitions in the kafka 
> external plugin for Spark Streaming
> --
>
> Key: SPARK-29799
> URL: https://issues.apache.org/jira/browse/SPARK-29799
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: zengrui
>Priority: Major
> Attachments: 0001-add-implementation-for-issue-SPARK-29799.patch
>
>
> When we use Spark Streaming to consume records from kafka, the generated 
> KafkaRDD‘s partition number is equal to kafka topic's partition number, so we 
> can not use more cpu cores to execute the streaming task except we change the 
> topic's partition number,but we can not increase the topic's partition number 
> infinitely.
> Now I think we can split a kafka partition into multiple KafkaRDD partitions, 
> and we can config
> it, then we can use more cpu cores to execute the streaming task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31414) Performance regression with new TimestampFormatter for json and csv

2020-04-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31414.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28181
[https://github.com/apache/spark/pull/28181]

> Performance regression with new TimestampFormatter for json and csv
> ---
>
> Key: SPARK-31414
> URL: https://issues.apache.org/jira/browse/SPARK-31414
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> with benchmark original, where the timestamp values are valid to new parser
> the result is 
> {code:java}
> [info] Running benchmark: Read dates and timestamps
> [info]   Running case: timestamp strings
> [info]   Stopped after 3 iterations, 5781 ms
> [info]   Running case: parse timestamps from Dataset[String]
> [info]   Stopped after 3 iterations, 44764 ms
> [info]   Running case: infer timestamps from Dataset[String]
> [info]   Stopped after 3 iterations, 93764 ms
> [info]   Running case: from_json(timestamp)
> [info]   Stopped after 3 iterations, 59021 ms
> {code}
> when we modify the benchmark to 
> {code:java}
>   def timestampStr: Dataset[String] = {
> spark.range(0, rowsNum, 1, 1).mapPartitions { iter =>
>   iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""")
> }.select($"value".as("timestamp")).as[String]
>   }
>   readBench.addCase("timestamp strings", numIters) { _ =>
> timestampStr.noop()
>   }
>   readBench.addCase("parse timestamps from Dataset[String]", numIters) { 
> _ =>
> spark.read.schema(tsSchema).json(timestampStr).noop()
>   }
>   readBench.addCase("infer timestamps from Dataset[String]", numIters) { 
> _ =>
> spark.read.json(timestampStr).noop()
>   }
> {code}
> where the timestamp values are invalid for the new parser which cause 
> fallback to legacy parser.
> the result is 
> {code:java}
> [info] Running benchmark: Read dates and timestamps
> [info]   Running case: timestamp strings
> [info]   Stopped after 3 iterations, 5623 ms
> [info]   Running case: parse timestamps from Dataset[String]
> [info]   Stopped after 3 iterations, 506637 ms
> [info]   Running case: infer timestamps from Dataset[String]
> [info]   Stopped after 3 iterations, 509076 ms
> {code}
> About 10x perf-regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29854) lpad and rpad built in function not throw Exception for invalid len value

2020-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29854.
--
Resolution: Cannot Reproduce

> lpad and rpad built in function not throw Exception for invalid len value
> -
>
> Key: SPARK-29854
> URL: https://issues.apache.org/jira/browse/SPARK-29854
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> Spark Returns Empty String)
> {code}
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> lpad('hihhh', 5000, '');
>  ++
> |lpad(hihhh, CAST(5000 AS INT), 
> )|
> ++
> ++
> Hive:
> SELECT lpad('hihhh', 5000, 
> '');
>  Error: Error while compiling statement: FAILED: SemanticException [Error 
> 10016]: Line 1:67 Argument type mismatch '''': lpad only takes 
> INT/SHORT/BYTE types as 2-ths argument, got DECIMAL (state=42000,code=10016)
> PostgreSQL
> function lpad(unknown, numeric, unknown) does not exist
>  
> Expected output:
> In Spark also it should throw Exception like Hive
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31427) Spark Structure streaming read data twice per every micro-batch.

2020-04-12 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082036#comment-17082036
 ] 

Jungtaek Lim commented on SPARK-31427:
--

Could you please check whether using Spark 3.0 preview 2 mitigates the issue? 
There're some improvements on caching Kafka consumers in Spark 3.0.0, so you 
may want to see whether it helps or not. If it doesn't help, it may need more 
investigation.

> Spark Structure streaming read data twice per every micro-batch.
> 
>
> Key: SPARK-31427
> URL: https://issues.apache.org/jira/browse/SPARK-31427
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Nick Hryhoriev
>Priority: Major
>
> I have a very strange issue with spark structure streaming. Spark structure 
> streaming creates two spark jobs for every micro-batch. As a result, read 
> data from Kafka twice. Here is a simple code snippet.
>  
> {code:java}
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.streaming.Trigger
> object CheckHowSparkReadFromKafka {
>   def main(args: Array[String]): Unit = {
> val session = SparkSession.builder()
>   .config(new SparkConf()
> .setAppName(s"simple read from kafka with repartition")
> .setMaster("local[*]")
> .set("spark.driver.host", "localhost"))
>   .getOrCreate()
> val testPath = "/tmp/spark-test"
> FileSystem.get(session.sparkContext.hadoopConfiguration).delete(new 
> Path(testPath), true)
> import session.implicits._
> val stream = session
>   .readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers","kafka-20002-prod:9092")
>   .option("subscribe", "topic")
>   .option("maxOffsetsPerTrigger", 1000)
>   .option("failOnDataLoss", false)
>   .option("startingOffsets", "latest")
>   .load()
>   .repartitionByRange( $"offset")
>   .writeStream
>   .option("path", testPath + "/data")
>   .option("checkpointLocation", testPath + "/checkpoint")
>   .format("parquet")
>   .trigger(Trigger.ProcessingTime(10.seconds))
>   .start()
> stream.processAllAvailable()
> {code}
> This happens because if {{.repartitionByRange( $"offset")}}, if I remove this 
> line, all good. But with spark create two jobs, one with 1 stage just read 
> from Kafka, the second with 3 stage read -> shuffle -> write. So the result 
> of the first job never used.
> This has a significant impact on performance. Some of my Kafka topics have 
> 1550 partitions, so read them twice is a big deal. In case I add cache, 
> things going better, but this is not a way for me. In local mode, the first 
> job in batch takes less than 0.1 ms, except batch with index 0. But in YARN 
> cluster and Messos both jobs fully expected and on my topics take near 1.2 
> min.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082039#comment-17082039
 ] 

Hyukjin Kwon commented on SPARK-31423:
--

cc [~maxgekk], [~cloud_fan] FYI.

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31413) Accessing the sequence number and partition id for records in Kinesis adapter

2020-04-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082040#comment-17082040
 ] 

Hyukjin Kwon commented on SPARK-31413:
--

Questions should go to the mailing list. You would have a better answer there.

> Accessing the sequence number and partition id for records in Kinesis adapter
> -
>
> Key: SPARK-31413
> URL: https://issues.apache.org/jira/browse/SPARK-31413
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.4.3
>Reporter: varun senthilnathan
>Priority: Critical
>
> We are using spark 2.4.3 in java. We would like to log the partition key and 
> the sequence number of every event. The overloaded create stream function of 
> the kinesis utils always throws a compilation error.
>  
> Function printSeq = s -> s;
>  KinesisUtils.createStream(
>  jssc,
>  appName,
>  streamName,
>  endPointUrl,
>  regionName,
>  InitialPositionInStream.TRIM_HORIZON,
>  kinesisCheckpointInterval,
>  StorageLevel.MEMORY_AND_DISK_SER(),
>  printSeq,
>  Record.class);
> *{color:#172b4d}Record is of type 
> com.amazonaws.services.kinesis.model.Record{color}*
>  
> The exception is as follows:
> {quote}no suitable method found for 
> createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
> {quote}
> JAVA DOCS : 
> [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-]
> Is there a way out?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31413) Accessing the sequence number and partition id for records in Kinesis adapter

2020-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31413.
--
Resolution: Invalid

> Accessing the sequence number and partition id for records in Kinesis adapter
> -
>
> Key: SPARK-31413
> URL: https://issues.apache.org/jira/browse/SPARK-31413
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.4.3
>Reporter: varun senthilnathan
>Priority: Critical
>
> We are using spark 2.4.3 in java. We would like to log the partition key and 
> the sequence number of every event. The overloaded create stream function of 
> the kinesis utils always throws a compilation error.
>  
> Function printSeq = s -> s;
>  KinesisUtils.createStream(
>  jssc,
>  appName,
>  streamName,
>  endPointUrl,
>  regionName,
>  InitialPositionInStream.TRIM_HORIZON,
>  kinesisCheckpointInterval,
>  StorageLevel.MEMORY_AND_DISK_SER(),
>  printSeq,
>  Record.class);
> *{color:#172b4d}Record is of type 
> com.amazonaws.services.kinesis.model.Record{color}*
>  
> The exception is as follows:
> {quote}no suitable method found for 
> createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class)
> {quote}
> JAVA DOCS : 
> [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-]
> Is there a way out?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31319) Document UDF in SQL Reference

2020-04-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31319.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28087
[https://github.com/apache/spark/pull/28087]

> Document UDF in SQL Reference
> -
>
> Key: SPARK-31319
> URL: https://issues.apache.org/jira/browse/SPARK-31319
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Document UDF in SQL Reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31319) Document UDF in SQL Reference

2020-04-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-31319:


Assignee: Huaxin Gao

> Document UDF in SQL Reference
> -
>
> Key: SPARK-31319
> URL: https://issues.apache.org/jira/browse/SPARK-31319
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Document UDF in SQL Reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31419) Document Table-valued Function and Inline Table

2020-04-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-31419:


Assignee: Huaxin Gao

> Document Table-valued Function and Inline Table
> ---
>
> Key: SPARK-31419
> URL: https://issues.apache.org/jira/browse/SPARK-31419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Document Table-valued Function and Inline Table



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31419) Document Table-valued Function and Inline Table

2020-04-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31419.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28185
[https://github.com/apache/spark/pull/28185]

> Document Table-valued Function and Inline Table
> ---
>
> Key: SPARK-31419
> URL: https://issues.apache.org/jira/browse/SPARK-31419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Document Table-valued Function and Inline Table



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-12 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-31429:
--

 Summary: Add additional fields in ExpressionDescription for more 
granular category in documentation
 Key: SPARK-31429
 URL: https://issues.apache.org/jira/browse/SPARK-31429
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.1.0
Reporter: Huaxin Gao


Add additional fields in ExpressionDescription so we can have more granular 
category in function documentation. For example, we want to group window 
function into finer categories such as ranking functions and analytic functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31383) Clean up the SQL documents in docs/sql-ref*

2020-04-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-31383:


Assignee: Takeshi Yamamuro

> Clean up the SQL documents in docs/sql-ref*
> ---
>
> Key: SPARK-31383
> URL: https://issues.apache.org/jira/browse/SPARK-31383
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> This ticket intends to clean up the SQL documents in `doc/sql-ref*`.
> Main changes are as follows;
> - Fixes wrong syntaxes and capitalize sub-titles
>  - Adds some DDL queries in `Examples` so that users can run examples there
>  - Makes query output in `Examples` follows the `Dataset.showString` 
> (right-aligned) format 
>  - Adds/Removes spaces, Indents, or blank lines to follow the format below;
> {code}
> ---
> license...
> ---
> ### Description
> Writes what's the syntax is.
> ### Syntax
> {% highlight sql %}
> SELECT...
>  WHERE... // 4 indents after the second line
>  ...
> {% endhighlight %}
> ### Parameters
> 
>  
>  Param Name
>  
>  Param Description
>  
>  ...
> 
> ### Examples
> {% highlight sql %}
> -- It is better that users are able to execute example queries here.
> -- So, we prepare test data in the first section if possible.
> CREATE TABLE t (key STRING, value DOUBLE);
> INSERT INTO t VALUES
>  ('a', 1.0), ('a', 2.0), ('b', 3.0), ('c', 4.0);
> -- query output has 2 indents and it follows the `Dataset.showString`
> -- format (right-aligned).
> SELECT * FROM t;
>  +---+-+
>  |key|value|
>  +---+-+
>  | a| 1.0|
>  | a| 2.0|
>  | b| 3.0|
>  | c| 4.0|
>  +---+-+
> -- Query statements after the second line have 4 indents.
> SELECT key, SUM(value)
>  FROM t
>  GROUP BY key;
>  +---+--+ 
>  |key|sum(value)|
>  +---+--+
>  | c| 4.0|
>  | b| 3.0|
>  | a| 3.0|
>  +---+--+
> ...
> {% endhighlight %}
> ### Related Statements
> * [XXX](xxx.html)
>  * ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-12 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082050#comment-17082050
 ] 

Huaxin Gao commented on SPARK-31429:


cc [~hyukjin.kwon] [~maropu]

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31383) Clean up the SQL documents in docs/sql-ref*

2020-04-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31383.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28151
[https://github.com/apache/spark/pull/28151]

> Clean up the SQL documents in docs/sql-ref*
> ---
>
> Key: SPARK-31383
> URL: https://issues.apache.org/jira/browse/SPARK-31383
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.0.0
>
>
> This ticket intends to clean up the SQL documents in `doc/sql-ref*`.
> Main changes are as follows;
> - Fixes wrong syntaxes and capitalize sub-titles
>  - Adds some DDL queries in `Examples` so that users can run examples there
>  - Makes query output in `Examples` follows the `Dataset.showString` 
> (right-aligned) format 
>  - Adds/Removes spaces, Indents, or blank lines to follow the format below;
> {code}
> ---
> license...
> ---
> ### Description
> Writes what's the syntax is.
> ### Syntax
> {% highlight sql %}
> SELECT...
>  WHERE... // 4 indents after the second line
>  ...
> {% endhighlight %}
> ### Parameters
> 
>  
>  Param Name
>  
>  Param Description
>  
>  ...
> 
> ### Examples
> {% highlight sql %}
> -- It is better that users are able to execute example queries here.
> -- So, we prepare test data in the first section if possible.
> CREATE TABLE t (key STRING, value DOUBLE);
> INSERT INTO t VALUES
>  ('a', 1.0), ('a', 2.0), ('b', 3.0), ('c', 4.0);
> -- query output has 2 indents and it follows the `Dataset.showString`
> -- format (right-aligned).
> SELECT * FROM t;
>  +---+-+
>  |key|value|
>  +---+-+
>  | a| 1.0|
>  | a| 2.0|
>  | b| 3.0|
>  | c| 4.0|
>  +---+-+
> -- Query statements after the second line have 4 indents.
> SELECT key, SUM(value)
>  FROM t
>  GROUP BY key;
>  +---+--+ 
>  |key|sum(value)|
>  +---+--+
>  | c| 4.0|
>  | b| 3.0|
>  | a| 3.0|
>  +---+--+
> ...
> {% endhighlight %}
> ### Related Statements
> * [XXX](xxx.html)
>  * ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-12 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082051#comment-17082051
 ] 

Maxim Gekk commented on SPARK-31423:


This is intentional behavior because ORC format assumes the hybrid calendar 
(Julian + Gregorian) but Parquet and Avro assume Proleptic Gregorian calendar. 
See https://issues.apache.org/jira/browse/SPARK-30951

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31399) Closure cleaner broken in Scala 2.12

2020-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31399:
-
Summary: Closure cleaner broken in Scala 2.12  (was: closure cleaner is 
broken in Scala 2.12)

> Closure cleaner broken in Scala 2.12
> 
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   - field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
>   - object (class $iw, $iw@2d87ac2b)
>   - element of array (index: 0)
>   - array (class [Ljava.lang.Object;, size 1)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class $iw, 
> functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
> instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
>   ... 47 more
> {code}
> **Apache Spark 2.4.5 with Scala 2.12**
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.5
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
>   at org.apache.spark.SparkContext.clea

[jira] [Issue Comment Deleted] (SPARK-31423) DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC

2020-04-12 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31423:
---
Comment: was deleted

(was: This is intentional behavior because ORC format assumes the hybrid 
calendar (Julian + Gregorian) but Parquet and Avro assume Proleptic Gregorian 
calendar. See https://issues.apache.org/jira/browse/SPARK-30951)

> DATES and TIMESTAMPS for a certain range are off by 10 days when stored in ORC
> --
>
> Key: SPARK-31423
> URL: https://issues.apache.org/jira/browse/SPARK-31423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Bruce Robbins
>Priority: Major
>
> There is a range of days (1582-10-05 to 1582-10-14) for which DATEs and 
> TIMESTAMPS are changed when stored in ORC. The value is off by 10 days.
> For example:
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.show // seems fine
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_date")
> scala> spark.read.orc("/tmp/funny_orc_date").show // off by 10 days
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala>
> {noformat}
> ORC has the same issue with TIMESTAMPS:
> {noformat}
> scala> val df = sql("select cast('1582-10-14 00:00:00' as TIMESTAMP) ts")
> df: org.apache.spark.sql.DataFrame = [ts: timestamp]
> scala> df.show // seems fine
> +---+
> | ts|
> +---+
> |1582-10-14 00:00:00|
> +---+
> scala> df.write.mode("overwrite").orc("/tmp/funny_orc_timestamp")
> scala> spark.read.orc("/tmp/funny_orc_timestamp").show(truncate=false) // off 
> by 10 days
> +---+
> |ts |
> +---+
> |1582-10-24 00:00:00|
> +---+
> scala> 
> {noformat}
> However, when written to Parquet or Avro, DATES and TIMESTAMPs for this range 
> do not change.
> {noformat}
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").parquet("/tmp/funny_parquet_date")
> scala> spark.read.parquet("/tmp/funny_parquet_date").show // reflects 
> original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> val df = sql("select cast('1582-10-14' as DATE) dt")
> df: org.apache.spark.sql.DataFrame = [dt: date]
> scala> df.write.mode("overwrite").format("avro").save("/tmp/funny_avro_date")
> scala> spark.read.format("avro").load("/tmp/funny_avro_date").show // 
> reflects original value
> +--+
> |dt|
> +--+
> |1582-10-14|
> +--+
> scala> 
> {noformat}
> It's unclear to me whether ORC is behaving correctly or not, as this is how 
> Spark 2.4 works with DATEs and TIMESTAMPs in general (and also how Spark 3.x 
> works with DATEs and TIMESTAMPs in general when 
> {{spark.sql.legacy.timeParserPolicy}} is set to {{LEGACY}}). In Spark 2.4, 
> DATEs and TIMESTAMPs in this range don't exist:
> {noformat}
> scala> sql("select cast('1582-10-14' as DATE) dt").show // the same cast done 
> in Spark 2.4
> +--+
> |dt|
> +--+
> |1582-10-24|
> +--+
> scala> 
> {noformat}
> I assume the following snippet is relevant (from the Wikipedia entry on the 
> Gregorian calendar):
> {quote}To deal with the 10 days' difference (between calendar and 
> reality)[Note 2] that this drift had already reached, the date was advanced 
> so that 4 October 1582 was followed by 15 October 1582
> {quote}
> Spark 3.x should treat DATEs and TIMESTAMPS in this range consistently, and 
> probably based on spark.sql.legacy.timeParserPolicy (or some other config) 
> rather than file format.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31386) Reading broadcast in UDF raises MemoryError when spark.executor.pyspark.memory is set

2020-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31386.
--
Resolution: Incomplete

Can you reproduce in a plain Spark cluster, not in EMR? possibly an issue 
specific to EMR. I will leave it as incomplete until the information is 
provided.


> Reading broadcast in UDF raises MemoryError when 
> spark.executor.pyspark.memory is set
> -
>
> Key: SPARK-31386
> URL: https://issues.apache.org/jira/browse/SPARK-31386
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
> Environment: Spark 2.4.4 or AWS EMR
> `pyspark --conf spark.executor.pyspark.memory=500m`
>Reporter: Viacheslav Krot
>Priority: Major
> Attachments: 选区_267.png
>
>
> Following code with udf causes MemoryError when 
> `spark.executor.pyspark.memory` is set
> ```
> from pyspark.sql.types import BooleanType
>  from pyspark.sql.functions import udf
> df = spark.createDataFrame([
>    ('Alice', 10),
>    ('Bob', 12)
>  ], ['name', 'cnt'])
> broadcast = spark.sparkContext.broadcast([1,2,3])
> @udf(BooleanType())
>  def f(cnt):
>    return cnt < len(broadcast.value)
> df.filter(f(df.cnt)).count()
> ```
> Same code work well when spark.executor.pyspark.memory is not set. 
> The code by itself does not make any sense, just simplest code to reproduce 
> the bug.
>  
> Error:
> ```
> 20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 6, 
> ip-172-31-32-201.us-east-2.compute.internal, executor 2): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):20/04/08 13:16:50 WARN TaskSetManager: Lost task 3.0 in stage 2.0 (TID 
> 6, ip-172-31-32-201.us-east-2.compute.internal, executor 2): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
>  line 377, in main    process()  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
>  line 372, in process    serializer.dump_stream(func(split_index, iterator), 
> outfile)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
>  line 345, in dump_stream    
> self.serializer.dump_stream(self._batched(iterator), stream)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
>  line 141, in dump_stream    for obj in iterator:  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/serializers.py",
>  line 334, in _batched    for item in iterator:  File "", line 1, in 
>   File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/worker.py",
>  line 85, in     return lambda *a: f(*a)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/util.py",
>  line 113, in wrapper    return f(*args, **kwargs)  File "", line 3, 
> in f  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py",
>  line 148, in value    self._value = self.load_from_path(self._path)  File 
> "/mnt/yarn/usercache/hadoop/appcache/application_1586348201584_0006/container_1586348201584_0006_01_03/pyspark.zip/pyspark/broadcast.py",
>  line 124, in load_from_path    with open(path, 'rb', 1 << 20) as 
> f:MemoryError
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
>  at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:81)
>  at 
> org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:64)
>  at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>  at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
>  Source) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.proces

[jira] [Resolved] (SPARK-31376) Non-global sort support for structured streaming

2020-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31376.
--
Resolution: Invalid

> Non-global sort support for structured streaming
> 
>
> Key: SPARK-31376
> URL: https://issues.apache.org/jira/browse/SPARK-31376
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Adam Binford
>Priority: Minor
>
> Currently, all sorting is disallowed with structured streaming queries. Not 
> allowing global sorting makes sense, but could non-global sorting (i.e. 
> sortWithinPartitions) be allowed? I'm running into this with an external 
> source I'm using, but not sure if this would be useful to file sources as 
> well. I have to foreachBatch so that I can do a sortWithinPartitions.
> Two main questions:
>  * Does a local sort cause issues with any exactly-once guarantees streaming 
> queries provides? I can't say I know or understand how these semantics work. 
> Or are there other issues I can't think of this would cause?
>  * Is the change as simple as changing the unsupported operations check to 
> only look for global sorts instead of all sorts?
> I have built a version that simply changes the unsupported check to only 
> disallow global sorts and it seems to be working. Anything I'm missing or is 
> it this simple?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31376) Non-global sort support for structured streaming

2020-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31376.
--
Resolution: Incomplete

> Non-global sort support for structured streaming
> 
>
> Key: SPARK-31376
> URL: https://issues.apache.org/jira/browse/SPARK-31376
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Adam Binford
>Priority: Minor
>
> Currently, all sorting is disallowed with structured streaming queries. Not 
> allowing global sorting makes sense, but could non-global sorting (i.e. 
> sortWithinPartitions) be allowed? I'm running into this with an external 
> source I'm using, but not sure if this would be useful to file sources as 
> well. I have to foreachBatch so that I can do a sortWithinPartitions.
> Two main questions:
>  * Does a local sort cause issues with any exactly-once guarantees streaming 
> queries provides? I can't say I know or understand how these semantics work. 
> Or are there other issues I can't think of this would cause?
>  * Is the change as simple as changing the unsupported operations check to 
> only look for global sorts instead of all sorts?
> I have built a version that simply changes the unsupported check to only 
> disallow global sorts and it seems to be working. Anything I'm missing or is 
> it this simple?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-31376) Non-global sort support for structured streaming

2020-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-31376:
--

> Non-global sort support for structured streaming
> 
>
> Key: SPARK-31376
> URL: https://issues.apache.org/jira/browse/SPARK-31376
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Adam Binford
>Priority: Minor
>
> Currently, all sorting is disallowed with structured streaming queries. Not 
> allowing global sorting makes sense, but could non-global sorting (i.e. 
> sortWithinPartitions) be allowed? I'm running into this with an external 
> source I'm using, but not sure if this would be useful to file sources as 
> well. I have to foreachBatch so that I can do a sortWithinPartitions.
> Two main questions:
>  * Does a local sort cause issues with any exactly-once guarantees streaming 
> queries provides? I can't say I know or understand how these semantics work. 
> Or are there other issues I can't think of this would cause?
>  * Is the change as simple as changing the unsupported operations check to 
> only look for global sorts instead of all sorts?
> I have built a version that simply changes the unsupported check to only 
> disallow global sorts and it seems to be working. Anything I'm missing or is 
> it this simple?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31376) Non-global sort support for structured streaming

2020-04-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082055#comment-17082055
 ] 

Hyukjin Kwon commented on SPARK-31376:
--

I am resolving it for now - seems definitely better to have a concrete idea 
before filing it as an issue.  At this momnet, seems rather a question to me.

> Non-global sort support for structured streaming
> 
>
> Key: SPARK-31376
> URL: https://issues.apache.org/jira/browse/SPARK-31376
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Adam Binford
>Priority: Minor
>
> Currently, all sorting is disallowed with structured streaming queries. Not 
> allowing global sorting makes sense, but could non-global sorting (i.e. 
> sortWithinPartitions) be allowed? I'm running into this with an external 
> source I'm using, but not sure if this would be useful to file sources as 
> well. I have to foreachBatch so that I can do a sortWithinPartitions.
> Two main questions:
>  * Does a local sort cause issues with any exactly-once guarantees streaming 
> queries provides? I can't say I know or understand how these semantics work. 
> Or are there other issues I can't think of this would cause?
>  * Is the change as simple as changing the unsupported operations check to 
> only look for global sorts instead of all sorts?
> I have built a version that simply changes the unsupported check to only 
> disallow global sorts and it seems to be working. Anything I'm missing or is 
> it this simple?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31374) Returning complex types in Pandas UDF

2020-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31374:
-
Target Version/s:   (was: 3.0.0)

> Returning complex types in Pandas UDF
> -
>
> Key: SPARK-31374
> URL: https://issues.apache.org/jira/browse/SPARK-31374
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: F. H.
>Priority: Major
>  Labels: features
>
> I would like to return a complex type in an GROUPED_AGG operation:
> {code:python}
> window_overlap_schema = t.StructType([
>  t.StructField("counts", t.ArrayType(t.LongType())),
>  t.StructField("starts", t.ArrayType(t.LongType())),
>  t.StructField("ends", t.ArrayType(t.LongType())),
> ])
> @f.pandas_udf(window_overlap_schema, f.PandasUDFType.GROUPED_AGG)
> def spark_window_overlap([...]):
> [...]
> {code}
> However, I get the following error when trying to run this:
> {code:python}
> NotImplementedError: Invalid returnType with grouped aggregate Pandas UDFs: 
> StructType(List(StructField(counts,ArrayType(LongType,true),true),StructField(starts,ArrayType(LongType,true),true),StructField(ends,ArrayType(LongType,true),true)))
>  is not supported
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31375) Overwriting into dynamic partitions is appending data in pyspark

2020-04-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082057#comment-17082057
 ] 

Hyukjin Kwon commented on SPARK-31375:
--

[~Chaitanya Chaganti]can you show a self-contained example to show the 
reproducible steps? It's difficult for me to parse the description..

> Overwriting into dynamic partitions is appending data in pyspark
> 
>
> Key: SPARK-31375
> URL: https://issues.apache.org/jira/browse/SPARK-31375
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.3
> Environment: databricks, s3, EMR, PySpark.
>Reporter: Sai Krishna Chaitanya Chaganti
>Priority: Major
>
> While overwriting data in specific partitions using insertInto  , spark is 
> appending data to specific partitions though the mode is overwrite. Below 
> property is set in config to ensure that we don't overwrite all partitions. 
> If the below property is set to static it is truncating and inserting the 
> data.
>  _spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')_
>  _df.write.mode('overwrite').format('parquet').insertInto(.)_
> However if the above statement is changed to 
> _df.write.mode('overwrite').format('parquet').insertInto(.,overwrite=True)_
>  It starts behaving correct, I mean overwrites the data into specific 
> partition. 
> It seems   though the save mode has been mentioned earlier, precedence is 
> given to the parameter set in insertInto method call.  
> +_*insertInto(.,overwrite=True)*_+  
> It is happening in pyspark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31373) Cluster tried to fetch blocks from blacklisted node of previous stage

2020-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31373.
--
Resolution: Invalid

Let's ask questions into mailing list rather then filing as an issue (see also 
https://spark.apache.org/community.html)

> Cluster tried to fetch blocks from blacklisted node of previous stage
> -
>
> Key: SPARK-31373
> URL: https://issues.apache.org/jira/browse/SPARK-31373
> Project: Spark
>  Issue Type: Question
>  Components: Block Manager
>Affects Versions: 2.4.2
> Environment: EMR cluster with r5.4xlarge and r5.8xlarge instances
>Reporter: Yuchen Feng
>Priority: Major
>
> We enabled blacklist on our Spark application but recently we saw some wierd 
> issue.
> Our code is like
>   {{rdd.repartitions(...).mapPartitions(...).groupByKey(...).map().collect()}}
> In mapPartitions stage, some executors has exception "Can't connect to host 
> xx: Connection rest by peer" and tasks on them were failed, so all 
> executors under this node were blacklisted, as well as this node. These 
> executors did complete some tasks before blacklisted.
> Then in next stage (groupByKey(...).map()), application failed with block 
> fetch failure: IndexOutOfBound Exception when some healthy executor want to 
> fetch block from one of above blacklisted executors.
> It happened multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31367) add octet_length to functions

2020-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31367.
--
Resolution: Incomplete

I am leaving this resolved for now due to no feedback from the reporter.

> add octet_length to functions
> -
>
> Key: SPARK-31367
> URL: https://issues.apache.org/jira/browse/SPARK-31367
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Bill Schneider
>Priority: Major
>
> Many functions are available in statically-typed Scala code via 
> org.apache.spark.sql.functions, for example `length`: 
> [https://github.com/apache/spark/blob/f5250a581b765ef8c9044438b5d27461f843b7fd/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2407]
> I'm requesting `octet_length` to be included similarly. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31430) Bug in the approximate quantile computation.

2020-04-12 Thread Siddartha Naidu (Jira)
Siddartha Naidu created SPARK-31430:
---

 Summary: Bug in the approximate quantile computation.
 Key: SPARK-31430
 URL: https://issues.apache.org/jira/browse/SPARK-31430
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Siddartha Naidu


I am seeing a bug where passing lower relative error to the {{approxQuantile}} 
function is leading to incorrect result in the presence of partitions. Setting 
a relative error 1e-6 causes it to compute equal values for 0.9 and 1.0 
quantiles. Coalescing it back to 1 partition gives correct results. This issue 
was not present in spark version 2.4.5, we noticed it when testing 
3.0.0-preview.

{{>>> df = spark.read.csv('file:///tmp/approx_quantile_data.csv', header=True, 
schema=T.StructType([T.StructField('Store',T.StringType(),True),T.StructField('seconds',T.LongType(),True)]))}}
{{>>> df = df.repartition(200, 'Store').localCheckpoint()}}
{{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.0001)}}
{{[1422576000.0, 1430352000.0, 1438300800.0]}}
{{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.1)}}
{{[1422576000.0, 1430524800.0, 1438300800.0]}}
{color:#de350b}{{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 
0.01)}}{color}
{color:#de350b}{{[1422576000.0, 1438300800.0, 1438300800.0]}}{color}
{{>>> df.coalesce(1).approxQuantile('seconds', [0.8, 0.9, 1.0], 0.01)}}
{{[1422576000.0, 1430524800.0, 1438300800.0]}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31403) TreeNode asCode function incorrectly handles null literals

2020-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31403.
--
Resolution: Cannot Reproduce

Seems I can't reproduce from the master. Probably fixed somewhere else:

{code}
import org.apache.spark.sql.catalyst.plans.logical.Project

spark.range(10).createOrReplaceTempView("users")

val plan = (spark
  .sql("select if(isnull(id), null, 2) from users")
  .queryExecution
  .optimizedPlan)

println(plan.asInstanceOf[Project].projectList.head.asCode) 
{code}

> TreeNode asCode function incorrectly handles null literals
> --
>
> Key: SPARK-31403
> URL: https://issues.apache.org/jira/browse/SPARK-31403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Carl Sverre
>Priority: Minor
>
> In the TreeNode code in Catalyst the asCode function incorrectly handles null 
> literals.  When it tries to render a null literal it will match {{null}} 
> using the third case expression and try to call {{null.toString}} which will 
> raise a NullPointerException.
> I verified this bug exists in Spark 2.4.4 and the same code appears to be in 
> master:
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L707]
> The fix seems trivial - add an explicit case for null.
> One way to reproduce this is via:
> {code:java}
>   val plan =
> spark
>   .sql("select if(isnull(id), null, 2) from testdb_jdbc.users")
>   .queryExecution
>   .optimizedPlan
>   println(plan.asInstanceOf[Project].projectList.head.asCode) {code}
> However any other way which generates a Literal with the value null will 
> cause the issue.
> In this case the above SparkSQL will generate the literal: {{Literal(null, 
> IntegerType)}} for the "trueValue" of the if statement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31403) TreeNode asCode function incorrectly handles null literals

2020-04-12 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31403:
-
Component/s: (was: Spark Core)
 SQL

> TreeNode asCode function incorrectly handles null literals
> --
>
> Key: SPARK-31403
> URL: https://issues.apache.org/jira/browse/SPARK-31403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Carl Sverre
>Priority: Minor
>
> In the TreeNode code in Catalyst the asCode function incorrectly handles null 
> literals.  When it tries to render a null literal it will match {{null}} 
> using the third case expression and try to call {{null.toString}} which will 
> raise a NullPointerException.
> I verified this bug exists in Spark 2.4.4 and the same code appears to be in 
> master:
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L707]
> The fix seems trivial - add an explicit case for null.
> One way to reproduce this is via:
> {code:java}
>   val plan =
> spark
>   .sql("select if(isnull(id), null, 2) from testdb_jdbc.users")
>   .queryExecution
>   .optimizedPlan
>   println(plan.asInstanceOf[Project].projectList.head.asCode) {code}
> However any other way which generates a Literal with the value null will 
> cause the issue.
> In this case the above SparkSQL will generate the literal: {{Literal(null, 
> IntegerType)}} for the "trueValue" of the if statement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082064#comment-17082064
 ] 

Hyukjin Kwon commented on SPARK-31429:
--

[~huaxingao] can you also related JIRA links?

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-12 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082064#comment-17082064
 ] 

Hyukjin Kwon edited comment on SPARK-31429 at 4/13/20, 5:08 AM:


[~huaxingao] can you also add related JIRA links?


was (Author: hyukjin.kwon):
[~huaxingao] can you also related JIRA links?

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-12 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082068#comment-17082068
 ] 

Huaxin Gao commented on SPARK-31429:


related Jira
https://issues.apache.org/jira/browse/SPARK-31390
https://issues.apache.org/jira/browse/SPARK-31349

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31430) Bug in the approximate quantile computation.

2020-04-12 Thread Siddartha Naidu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddartha Naidu updated SPARK-31430:

Attachment: approx_quantile_data.csv

> Bug in the approximate quantile computation.
> 
>
> Key: SPARK-31430
> URL: https://issues.apache.org/jira/browse/SPARK-31430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Siddartha Naidu
>Priority: Major
> Attachments: approx_quantile_data.csv
>
>
> I am seeing a bug where passing lower relative error to the 
> {{approxQuantile}} function is leading to incorrect result in the presence of 
> partitions. Setting a relative error 1e-6 causes it to compute equal values 
> for 0.9 and 1.0 quantiles. Coalescing it back to 1 partition gives correct 
> results. This issue was not present in spark version 2.4.5, we noticed it 
> when testing 3.0.0-preview.
> {{>>> df = spark.read.csv('file:///tmp/approx_quantile_data.csv', 
> header=True, 
> schema=T.StructType([T.StructField('Store',T.StringType(),True),T.StructField('seconds',T.LongType(),True)]))}}
> {{>>> df = df.repartition(200, 'Store').localCheckpoint()}}
> {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.0001)}}
> {{[1422576000.0, 1430352000.0, 1438300800.0]}}
> {{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 0.1)}}
> {{[1422576000.0, 1430524800.0, 1438300800.0]}}
> {color:#de350b}{{>>> df.approxQuantile('seconds', [0.8, 0.9, 1.0], 
> 0.01)}}{color}
> {color:#de350b}{{[1422576000.0, 1438300800.0, 1438300800.0]}}{color}
> {{>>> df.coalesce(1).approxQuantile('seconds', [0.8, 0.9, 1.0], 0.01)}}
> {{[1422576000.0, 1430524800.0, 1438300800.0]}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31398) Speed up reading dates in ORC

2020-04-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31398:
---

Assignee: Maxim Gekk

> Speed up reading dates in ORC
> -
>
> Key: SPARK-31398
> URL: https://issues.apache.org/jira/browse/SPARK-31398
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Currently, ORC datasource converts values of DATE type to java.sql.Date and 
> the result to days since the epoch in Proleptic Gregorian calendar. ORC 
> datasource does such conversion when 
> spark.sql.orc.enableVectorizedReader is set to false.
> The conversion to java.sql.Date is not necessary because we can use 
> DaysWritable which performs rebasing in much more optimal way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31398) Speed up reading dates in ORC

2020-04-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31398.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28169
[https://github.com/apache/spark/pull/28169]

> Speed up reading dates in ORC
> -
>
> Key: SPARK-31398
> URL: https://issues.apache.org/jira/browse/SPARK-31398
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, ORC datasource converts values of DATE type to java.sql.Date and 
> the result to days since the epoch in Proleptic Gregorian calendar. ORC 
> datasource does such conversion when 
> spark.sql.orc.enableVectorizedReader is set to false.
> The conversion to java.sql.Date is not necessary because we can use 
> DaysWritable which performs rebasing in much more optimal way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-12 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-31429:
-
Description: 
Add additional fields in ExpressionDescription so we can have more granular 
category in function documentation. For example, we want to group window 
function into finer categories such as ranking functions and analytic functions.

See a link below for related discussions;
https://github.com/apache/spark/pull/28170#issuecomment-611917191

  was:Add additional fields in ExpressionDescription so we can have more 
granular category in function documentation. For example, we want to group 
window function into finer categories such as ranking functions and analytic 
functions.


> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.
> See a link below for related discussions;
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-12 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-31429:
-
Description: 
Add additional fields in ExpressionDescription so we can have more granular 
category in function documentation. For example, we want to group window 
function into finer categories such as ranking functions and analytic functions.

See Hyukjin's comment below for more details;
https://github.com/apache/spark/pull/28170#issuecomment-611917191

  was:
Add additional fields in ExpressionDescription so we can have more granular 
category in function documentation. For example, we want to group window 
function into finer categories such as ranking functions and analytic functions.

See a link below for related discussions;
https://github.com/apache/spark/pull/28170#issuecomment-611917191


> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.
> See Hyukjin's comment below for more details;
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-12 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17082077#comment-17082077
 ] 

Takeshi Yamamuro commented on SPARK-31429:
--

Thanks for filing, Huaxin. I added some links.

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.
> See Hyukjin's comment below for more details;
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31431) CalendarInterval encoder support

2020-04-12 Thread Kent Yao (Jira)
Kent Yao created SPARK-31431:


 Summary: CalendarInterval encoder support
 Key: SPARK-31431
 URL: https://issues.apache.org/jira/browse/SPARK-31431
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Kent Yao


CalenderInterval is available to be converted to/from internal Spark SQL 
representation when it is a member of a Scala's product type e.g tuples/ case 
class etc but not as a primitive type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2020-04-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-18886:
---

Assignee: Nicholas Brett Marcott

> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> ---
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>Assignee: Nicholas Brett Marcott
>Priority: Major
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".  *NOTE*: due to SPARK-18967, avoid 
> setting the {{spark.locality.wait=0}} -- instead, use 
> {{spark.locality.wait=1ms}}.
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality preferences.  Users may be able to modify their job to 
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially 
> if you do a repartition starting from a small number of partitions.  (Shuffle 
> locality preference is assigned if any node has more than 20% of the shuffle 
> input data -- by chance, you may have one node just above that threshold, and 
> all other nodes just below it.)  In this case, you can turn off locality 
> preference for shuffle data by setting 
> {{spark.shuffle.reduceLocality.enabled=false}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2020-04-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18886.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> ---
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>Assignee: Nicholas Brett Marcott
>Priority: Major
> Fix For: 3.1.0
>
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".  *NOTE*: due to SPARK-18967, avoid 
> setting the {{spark.locality.wait=0}} -- instead, use 
> {{spark.locality.wait=1ms}}.
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality preferences.  Users may be able to modify their job to 
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially 
> if you do a repartition starting from a small number of partitions.  (Shuffle 
> locality preference is assigned if any node has more than 20% of the shuffle 
> input data -- by chance, you may have one node just above that threshold, and 
> all other nodes just below it.)  In this case, you can turn off locality 
> preference for shuffle data by setting 
> {{spark.shuffle.reduceLocality.enabled=false}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31407) Fix hive/SQLQuerySuite.derived from Hive query file: drop_database_removes_partition_dirs.q

2020-04-12 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-31407:
-
Description: Test "derived from Hive query file: 
drop_database_removes_partition_dirs.q" can fail if we run it separately but 
can success running the whole hive/SQLQuerySuite.  (was: Test "derived from 
Hive query file: drop_database_removes_partition_dirs.q" can fail if we run it 
separately rather than running the whole hive/SQLQuerySuite.)

> Fix hive/SQLQuerySuite.derived from Hive query file: 
> drop_database_removes_partition_dirs.q
> ---
>
> Key: SPARK-31407
> URL: https://issues.apache.org/jira/browse/SPARK-31407
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: wuyi
>Priority: Major
>
> Test "derived from Hive query file: drop_database_removes_partition_dirs.q" 
> can fail if we run it separately but can success running the whole 
> hive/SQLQuerySuite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31407) Fix hive/SQLQuerySuite.derived from Hive query file: drop_database_removes_partition_dirs.q

2020-04-12 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-31407:
-
Description: Test "derived from Hive query file: 
drop_database_removes_partition_dirs.q" can fail if we run it separately but 
can success running with the whole hive/SQLQuerySuite.  (was: Test "derived 
from Hive query file: drop_database_removes_partition_dirs.q" can fail if we 
run it separately but can success running the whole hive/SQLQuerySuite.)

> Fix hive/SQLQuerySuite.derived from Hive query file: 
> drop_database_removes_partition_dirs.q
> ---
>
> Key: SPARK-31407
> URL: https://issues.apache.org/jira/browse/SPARK-31407
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: wuyi
>Priority: Major
>
> Test "derived from Hive query file: drop_database_removes_partition_dirs.q" 
> can fail if we run it separately but can success running with the whole 
> hive/SQLQuerySuite.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31402) Incorrect rebasing of BCE dates

2020-04-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31402.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28172
[https://github.com/apache/spark/pull/28172]

> Incorrect rebasing of BCE dates
> ---
>
> Key: SPARK-31402
> URL: https://issues.apache.org/jira/browse/SPARK-31402
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Dates of before common era are rebased incorrectly, see 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120679/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql/
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> postgreSQL/date.sql
> Expected "[-0044]-03-15", but got "[0045]-03-15" Result did not match for 
> query #93
> select make_date(-44, 3, 15)
> {code}
> Even such dates are out of the valid range of dates supported by the DATE 
> type, there is a test in postgreSQL/date.sql for a negative year, and it 
> would be nice to fix the issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31402) Incorrect rebasing of BCE dates

2020-04-12 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31402:
---

Assignee: Maxim Gekk

> Incorrect rebasing of BCE dates
> ---
>
> Key: SPARK-31402
> URL: https://issues.apache.org/jira/browse/SPARK-31402
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Dates of before common era are rebased incorrectly, see 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120679/testReport/org.apache.spark.sql/SQLQueryTestSuite/sql/
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> postgreSQL/date.sql
> Expected "[-0044]-03-15", but got "[0045]-03-15" Result did not match for 
> query #93
> select make_date(-44, 3, 15)
> {code}
> Even such dates are out of the valid range of dates supported by the DATE 
> type, there is a test in postgreSQL/date.sql for a negative year, and it 
> would be nice to fix the issue. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org