date:20190717

[jira] [Updated] (SPARK-28429) SQL Datetime util function being casted to double instead of timestamp

2019-07-17 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28429:

Component/s: (was: Tests)

> SQL Datetime util function being casted to double instead of timestamp
> --
>
> Key: SPARK-28429
> URL: https://issues.apache.org/jira/browse/SPARK-28429
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
>
> In the code below, 'now()+'100 days' are casted to double and then an error 
> is thrown:
> {code:sql}
> CREATE TEMP VIEW v_window AS
> SELECT i, min(i) over (order by i range between '1 day' preceding and '10 
> days' following) as min_i
> FROM range(now(), now()+'100 days', '1 hour') i;
> {code}
> Error:
> {code:sql}
> cannot resolve '(current_timestamp() + CAST('100 days' AS DOUBLE))' due to 
> data type mismatch: differing      types in '(current_timestamp() + CAST('100 
> days' AS DOUBLE))' (timestamp and double).;{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28411) insertInto with overwrite inconsistent behaviour Python/Scala

2019-07-17 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28411:


Assignee: Huaxin Gao

> insertInto with overwrite inconsistent behaviour Python/Scala
> -
>
> Key: SPARK-28411
> URL: https://issues.apache.org/jira/browse/SPARK-28411
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.1, 2.4.0
>Reporter: Maria Rebelka
>Assignee: Huaxin Gao
>Priority: Minor
>
> The df.write.mode("overwrite").insertInto("table") has inconsistent behaviour 
> between Scala and Python. In Python, insertInto ignores "mode" parameter and 
> appends by default. Only when changing syntax to df.write.insertInto("table", 
> overwrite=True) we get expected behaviour.
> This is a native Spark syntax, expected to be the same between languages... 
> Also, in other write methods, like saveAsTable or write.parquet "mode" seem 
> to be respected.
> Reproduce, Python, ignore "overwrite":
> {code:java}
> df = spark.createDataFrame(sc.parallelize([(1, 2),(3,4)]),['i','j'])
> # create the table and load data
> df.write.saveAsTable("spark_overwrite_issue")
> # insert overwrite, expected result - 2 rows
> df.write.mode("overwrite").insertInto("spark_overwrite_issue")
> spark.sql("select * from spark_overwrite_issue").count()
> # result - 4 rows, insert appended data instead of overwrite{code}
> Reproduce, Scala, works as expected:
> {code:java}
> val df = Seq((1, 2),(3,4)).toDF("i","j")
> df.write.mode("overwrite").insertInto("spark_overwrite_issue")
> spark.sql("select * from spark_overwrite_issue").count()
> # result - 2 rows{code}
> Tested on Spark 2.2.1 (EMR) and 2.4.0 (Databricks)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28411) insertInto with overwrite inconsistent behaviour Python/Scala

2019-07-17 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28411.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25175
[https://github.com/apache/spark/pull/25175]

> insertInto with overwrite inconsistent behaviour Python/Scala
> -
>
> Key: SPARK-28411
> URL: https://issues.apache.org/jira/browse/SPARK-28411
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.1, 2.4.0
>Reporter: Maria Rebelka
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> The df.write.mode("overwrite").insertInto("table") has inconsistent behaviour 
> between Scala and Python. In Python, insertInto ignores "mode" parameter and 
> appends by default. Only when changing syntax to df.write.insertInto("table", 
> overwrite=True) we get expected behaviour.
> This is a native Spark syntax, expected to be the same between languages... 
> Also, in other write methods, like saveAsTable or write.parquet "mode" seem 
> to be respected.
> Reproduce, Python, ignore "overwrite":
> {code:java}
> df = spark.createDataFrame(sc.parallelize([(1, 2),(3,4)]),['i','j'])
> # create the table and load data
> df.write.saveAsTable("spark_overwrite_issue")
> # insert overwrite, expected result - 2 rows
> df.write.mode("overwrite").insertInto("spark_overwrite_issue")
> spark.sql("select * from spark_overwrite_issue").count()
> # result - 4 rows, insert appended data instead of overwrite{code}
> Reproduce, Scala, works as expected:
> {code:java}
> val df = Seq((1, 2),(3,4)).toDF("i","j")
> df.write.mode("overwrite").insertInto("spark_overwrite_issue")
> spark.sql("select * from spark_overwrite_issue").count()
> # result - 2 rows{code}
> Tested on Spark 2.2.1 (EMR) and 2.4.0 (Databricks)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27609) from_json expects values of options dictionary to be

2019-07-17 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27609.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25182
[https://github.com/apache/spark/pull/25182]

> from_json expects values of options dictionary to be 
> -
>
> Key: SPARK-27609
> URL: https://issues.apache.org/jira/browse/SPARK-27609
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: I've found this issue on an AWS Glue development 
> endpoint which is running Spark 2.2.1 and being given jobs through a 
> SparkMagic Python 2 kernel, running through livy and all that. I don't know 
> how much of that is important for reproduction, and can get more details if 
> needed. 
>Reporter: Zachary Jablons
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> When reading a column of a DataFrame that consists of serialized JSON, one of 
> the options for inferring the schema and then parsing the JSON is to do a two 
> step process consisting of:
>  
> {code}
> # this results in a new dataframe where the top-level keys of the JSON # are 
> columns
> df_parsed_direct = spark.read.json(df.rdd.map(lambda row: row.json_col))
> # this does that while preserving the rest of df
> schema = df_parsed_direct.schema
> df_parsed = df.withColumn('parsed', from_json(df.json_col, schema)
> {code}
> When I do this, I sometimes find myself passing in options. My understanding 
> is, from the documentation 
> [here|http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json],
>  that the nature of these options passed should be the same whether I do
> {code}
> spark.read.option('option',value)
> {code}
> or
> {code}
> from_json(df.json_col, schema, options={'option':value})
> {code}
>  
> However, I've found that the latter expects value to be a string 
> representation of the value that can be decoded by JSON. So, for example 
> options=\{'multiLine':True} fails with 
> {code}
> java.lang.ClassCastException: java.lang.Boolean cannot be cast to 
> java.lang.String
> {code}
> whereas {{options=\{'multiLine':'true'}}} works just fine. 
> Notably, providing {{spark.read.option('multiLine',True)}} works fine!
> The code for reproducing this issue as well as the stacktrace from hitting it 
> are provided in [this 
> gist|https://gist.github.com/zmjjmz/0af5cf9b059b4969951e825565e266aa]. 
> I also noticed that from_json doesn't complain if you give it a garbage 
> option key – but that seems separate.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27609) from_json expects values of options dictionary to be

2019-07-17 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27609:


Assignee: Maxim Gekk

> from_json expects values of options dictionary to be 
> -
>
> Key: SPARK-27609
> URL: https://issues.apache.org/jira/browse/SPARK-27609
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: I've found this issue on an AWS Glue development 
> endpoint which is running Spark 2.2.1 and being given jobs through a 
> SparkMagic Python 2 kernel, running through livy and all that. I don't know 
> how much of that is important for reproduction, and can get more details if 
> needed. 
>Reporter: Zachary Jablons
>Assignee: Maxim Gekk
>Priority: Minor
>
> When reading a column of a DataFrame that consists of serialized JSON, one of 
> the options for inferring the schema and then parsing the JSON is to do a two 
> step process consisting of:
>  
> {code}
> # this results in a new dataframe where the top-level keys of the JSON # are 
> columns
> df_parsed_direct = spark.read.json(df.rdd.map(lambda row: row.json_col))
> # this does that while preserving the rest of df
> schema = df_parsed_direct.schema
> df_parsed = df.withColumn('parsed', from_json(df.json_col, schema)
> {code}
> When I do this, I sometimes find myself passing in options. My understanding 
> is, from the documentation 
> [here|http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json],
>  that the nature of these options passed should be the same whether I do
> {code}
> spark.read.option('option',value)
> {code}
> or
> {code}
> from_json(df.json_col, schema, options={'option':value})
> {code}
>  
> However, I've found that the latter expects value to be a string 
> representation of the value that can be decoded by JSON. So, for example 
> options=\{'multiLine':True} fails with 
> {code}
> java.lang.ClassCastException: java.lang.Boolean cannot be cast to 
> java.lang.String
> {code}
> whereas {{options=\{'multiLine':'true'}}} works just fine. 
> Notably, providing {{spark.read.option('multiLine',True)}} works fine!
> The code for reproducing this issue as well as the stacktrace from hitting it 
> are provided in [this 
> gist|https://gist.github.com/zmjjmz/0af5cf9b059b4969951e825565e266aa]. 
> I also noticed that from_json doesn't complain if you give it a garbage 
> option key – but that seems separate.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28434) Decision Tree model isn't equal after save and load

2019-07-17 Thread Ievgen Prokhorenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ievgen Prokhorenko updated SPARK-28434:
---
Description: 
The file 
`mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala` on 
the line no. 628 has a TODO saying:

 
{code:java}
// TODO: Check other fields besides the information gain.
{code}
If, in addition to the existing check of InformationGainStats' gain value I add 
another check, for instance, impurity – the test fails because the values are 
different in the saved model and the one restored from disk.

 

See PR with an example.

 

The tests are executed with this command:

 
{code:java}
build/mvn -e -Dtest=none 
-DwildcardSuites=org.apache.spark.mllib.tree.DecisionTreeSuite test{code}
 

Excerpts from the output of the command above:


{code:java}
...

- model save/load *** FAILED ***
checkEqual failed since the two trees were not identical.
TREE A:
DecisionTreeModel classifier of depth 2 with 5 nodes
If (feature 0 <= 0.5)
Predict: 0.0
Else (feature 0 > 0.5)
If (feature 1 in {0.0,1.0})
Predict: 0.0
Else (feature 1 not in {0.0,1.0})
Predict: 0.0

TREE B:
DecisionTreeModel classifier of depth 2 with 5 nodes
If (feature 0 <= 0.5)
Predict: 0.0
Else (feature 0 > 0.5)
If (feature 1 in {0.0,1.0})
Predict: 0.0
Else (feature 1 not in {0.0,1.0})
Predict: 0.0 (DecisionTreeSuite.scala:610)

...{code}
If I add a little debug info in the `DecisionTreeSuite.checkEqual`:

 
{code:java}
val aStats = a.stats
val bStats = b.stats


println(s"id ${a.id} ${b.id}")
println(s"impurity ${aStats.get.impurity} ${bStats.get.impurity}")
println(s"leftImpurity ${aStats.get.leftImpurity} ${bStats.get.leftImpurity}")
println(s"rightImpurity ${aStats.get.rightImpurity} 
${bStats.get.rightImpurity}")
println(s"leftPredict ${aStats.get.leftPredict} ${bStats.get.leftPredict}")
println(s"rightPredict ${aStats.get.rightPredict} ${bStats.get.rightPredict}")
println(s"gain ${aStats.get.gain} ${bStats.get.gain}")
{code}
 

Then, in the output of the test command we can see that only values of `gain` 
are equal:

 
{code:java}
id 1 1
impurity 0.2 0.5
leftImpurity 0.3 0.5
rightImpurity 0.4 0.5
leftPredict 1.0 (prob = 0.4) 0.0 (prob = 1.0)
rightPredict 0.0 (prob = 0.6) 0.0 (prob = 1.0)
gain 0.1 0.1
{code}

  was:
The file 
`mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala` on 
the line no. 628 has a TODO saying:

 
{code:java}
// TODO: Check other fields besides the information gain.
{code}
If, in addition to the existing check of InformationGainStats' gain value I add 
another check, for instance, impurity – the test fails because the values are 
different in the saved model and the one restored from disk.

 

See PR with an example.

 

The tests are executed with this command:
{code:java}
build/mvn -e -Dtest=none 
-DwildcardSuites=org.apache.spark.mllib.tree.DecisionTreeSuite test{code}
 


> Decision Tree model isn't equal after save and load
> ---
>
> Key: SPARK-28434
> URL: https://issues.apache.org/jira/browse/SPARK-28434
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.3
> Environment: spark from master
>Reporter: Ievgen Prokhorenko
>Priority: Major
>
> The file 
> `mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala` on 
> the line no. 628 has a TODO saying:
>  
> {code:java}
> // TODO: Check other fields besides the information gain.
> {code}
> If, in addition to the existing check of InformationGainStats' gain value I 
> add another check, for instance, impurity – the test fails because the values 
> are different in the saved model and the one restored from disk.
>  
> See PR with an example.
>  
> The tests are executed with this command:
>  
> {code:java}
> build/mvn -e -Dtest=none 
> -DwildcardSuites=org.apache.spark.mllib.tree.DecisionTreeSuite test{code}
>  
> Excerpts from the output of the command above:
> {code:java}
> ...
> - model save/load *** FAILED ***
> checkEqual failed since the two trees were not identical.
> TREE A:
> DecisionTreeModel classifier of depth 2 with 5 nodes
> If (feature 0 <= 0.5)
> Predict: 0.0
> Else (feature 0 > 0.5)
> If (feature 1 in {0.0,1.0})
> Predict: 0.0
> Else (feature 1 not in {0.0,1.0})
> Predict: 0.0
> TREE B:
> DecisionTreeModel classifier of depth 2 with 5 nodes
> If (feature 0 <= 0.5)
> Predict: 0.0
> Else (feature 0 > 0.5)
> If (feature 1 in {0.0,1.0})
> Predict: 0.0
> Else (feature 1 not in {0.0,1.0})
> Predict: 0.0 (DecisionTreeSuite.scala:610)
> ...{code}
> If I add a little debug info in the `DecisionTreeSuite.checkEqual`:
>  
> {code:java}
> val aStats = a.stats
> val bStats = b.stats
> println(s"id ${a.id} ${b.id}")
> println(s"impurity ${aStats.get.impurity} ${bStats.get.impurity}")
> println(s"leftImpurity

[jira] [Created] (SPARK-28434) Decision Tree model isn't equal after save and load

2019-07-17 Thread Ievgen Prokhorenko (JIRA)

Ievgen Prokhorenko created SPARK-28434:
--

 Summary: Decision Tree model isn't equal after save and load
 Key: SPARK-28434
 URL: https://issues.apache.org/jira/browse/SPARK-28434
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.4.3
 Environment: spark from master
Reporter: Ievgen Prokhorenko


The file 
`mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala` on 
the line no. 628 has a TODO saying:

 
{code:java}
// TODO: Check other fields besides the information gain.
{code}
If, in addition to the existing check of InformationGainStats' gain value I add 
another check, for instance, impurity – the test fails because the values are 
different in the saved model and the one restored from disk.

 

See PR with an example.

 

The tests are executed with this command:
{code:java}
build/mvn -e -Dtest=none 
-DwildcardSuites=org.apache.spark.mllib.tree.DecisionTreeSuite test{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28433) Incorrect assertion in scala test for aarch64 platform

2019-07-17 Thread huangtianhua (JIRA)

huangtianhua created SPARK-28433:


 Summary: Incorrect assertion in scala test for aarch64 platform
 Key: SPARK-28433
 URL: https://issues.apache.org/jira/browse/SPARK-28433
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3, 3.0.0
Reporter: huangtianhua


We ran unit tests of spark on aarch64 server, here are two sql scala tests 
failed: 
- SPARK-26021: NaN and -0.0 in grouping expressions *** FAILED ***
   2143289344 equaled 2143289344 (DataFrameAggregateSuite.scala:732)
 - NaN and -0.0 in window partition keys *** FAILED ***
   2143289344 equaled 2143289344 (DataFrameWindowFunctionsSuite.scala:704)

we found the values of floatToRawIntBits(0.0f / 0.0f) and 
floatToRawIntBits(Float.NaN) on aarch64 are same(2143289344),  first we thought 
it's something about jdk or scala, but after discuss with jdk-dev and scala 
community see 
https://users.scala-lang.org/t/the-value-of-floattorawintbits-0-0f-0-0f-is-different-on-x86-64-and-aarch64-platforms/4845
 , we believe the value depends on the architecture.





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10816) EventTime based sessionization

2019-07-17 Thread Chang chen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887561#comment-16887561
 ] 

Chang chen commented on SPARK-10816:


Hi guys

Any updates on this issue?

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf, Session 
> Window Support For Structure Streaming.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-28293) Implement Spark's own GetTableTypesOperation

2019-07-17 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28293:

Comment: was deleted

(was: I'm working on)

> Implement Spark's own GetTableTypesOperation
> 
>
> Key: SPARK-28293
> URL: https://issues.apache.org/jira/browse/SPARK-28293
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: Hive-1.2.1.png, Hive-2.3.5.png
>
>
> Build with Hive 1.2.1:
> !Hive-1.2.1.png!
> Build with Hive 2.3.5:
> !Hive-2.3.5.png!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28432) Date/Time Functions: make_date/make_timestamp

2019-07-17 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-28432:
---

 Summary: Date/Time Functions: make_date/make_timestamp
 Key: SPARK-28432
 URL: https://issues.apache.org/jira/browse/SPARK-28432
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


||Function||Return Type||Description||Example||Result||
|{{make_date(_year_ }}{{int}}{{, _month_ }}{{int}}{{, _day_ 
}}{{int}}{{)}}|{{date}}|Create date from year, month and day 
fields|{{make_date(2013, 7, 15)}}|{{2013-07-15}}|
|{{make_timestamp(_year_ }}{{int}}{{, _month_ }}{{int}}{{, _day_ }}{{int}}{{, 
_hour_ }}{{int}}{{, _min_ }}{{int}}{{, _sec_}}{{double 
precision}}{{)}}|{{timestamp}}|Create timestamp from year, month, day, hour, 
minute and seconds fields|{{make_timestamp(2013, 7, 15, 8, 15, 
23.5)}}|{{2013-07-15 08:15:23.5}}|

https://www.postgresql.org/docs/11/functions-datetime.html



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28293) Implement Spark's own GetTableTypesOperation

2019-07-17 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28293:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-28426

> Implement Spark's own GetTableTypesOperation
> 
>
> Key: SPARK-28293
> URL: https://issues.apache.org/jira/browse/SPARK-28293
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: Hive-1.2.1.png, Hive-2.3.5.png
>
>
> Build with Hive 1.2.1:
> !Hive-1.2.1.png!
> Build with Hive 2.3.5:
> !Hive-2.3.5.png!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28167) Show global temporary view in database tool

2019-07-17 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28167:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-28426

> Show global temporary view in database tool
> ---
>
> Key: SPARK-28167
> URL: https://issues.apache.org/jira/browse/SPARK-28167
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28184) Avoid creating new sessions in SparkMetadataOperationSuite

2019-07-17 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28184:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-28426

> Avoid creating new sessions in SparkMetadataOperationSuite
> --
>
> Key: SPARK-28184
> URL: https://issues.apache.org/jira/browse/SPARK-28184
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28431) CSV datasource throw com.univocity.parsers.common.TextParsingException with large size message

2019-07-17 Thread Weichen Xu (JIRA)

Weichen Xu created SPARK-28431:
--

 Summary: CSV datasource throw 
com.univocity.parsers.common.TextParsingException with large size message 
 Key: SPARK-28431
 URL: https://issues.apache.org/jira/browse/SPARK-28431
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Weichen Xu


CSV datasource throw com.univocity.parsers.common.TextParsingException with 
large size message, which will make log output consume large disk space.

Reproduce code
{code:java}
val s = "a" * 40 * 100
Seq(s).toDF.write.mode("overwrite").csv("/tmp/bogdan/es4196.csv")

spark.read .option("maxCharsPerColumn", 3000) 
.csv("/tmp/bogdan/es4196.csv").count{code}
Because of maxCharsPerColumn limit of 30M, there will be a 
TextParsingException. The message of this exception actually includes what was 
parsed so far, in this case 30M chars.

 

This issue is troublesome when sometimes we need parse CSV with large column.

We should truncate the large size message in the TextParsingException.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27570) java.io.EOFException Reached the end of stream - Reading Parquet from Swift

2019-07-17 Thread Josh Rosen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887498#comment-16887498
 ] 

Josh Rosen edited comment on SPARK-27570 at 7/18/19 12:28 AM:
--

[~ste...@apache.org], I finally got a chance to test your {{fadvise}} 
configuration recommendation and that resolved my issue. *However*, I think 
that there's a typo in your recommendation: this only worked when I used 
{{fs.s3a.experimental.*input*.fadvise}} (the {{.input}} was missing in your 
comment).

*Update*: filed HADOOP-16437 to fix the documentation typo.


was (Author: joshrosen):
[~ste...@apache.org], I finally got a chance to test your {{fadvise}} 
configuration recommendation and that resolved my issue. *However*, I think 
that there's a typo in your recommendation: this only worked when I used 
{{fs.s3a.experimental.*input*.fadvise}} (the {{.input}} was missing in your 
comment).

> java.io.EOFException Reached the end of stream - Reading Parquet from Swift
> ---
>
> Key: SPARK-27570
> URL: https://issues.apache.org/jira/browse/SPARK-27570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Harry Hough
>Priority: Major
>
> I did see issue SPARK-25966 but it seems there are some differences as his 
> problem was resolved after rebuilding the parquet files on write. This is 
> 100% reproducible for me across many different days of data.
> I get exceptions such as "Reached the end of stream with 750477 bytes left to 
> read" during some read operations of parquet files. I am reading these files 
> from Openstack swift using openstack-hadoop 2.7.7 on Spark 2.4.
> The issues seem to happen with the where statement. I have also tried filter 
> and combining the statements into one as well as the dataset method with 
> column without any luck. Which column or what the actual filter is on the 
> where also doesn't seem to make a difference to the error occurring or not.
>  
> {code:java}
> val engagementDS = spark
>   .read
>   .parquet(createSwiftAddr("engagements", folder))
>   .where("engtype != 0")
>   .where("engtype != 1000")
>   .groupBy($"accid", $"sessionkey")
>   .agg(collect_list(struct($"time", $"pid", $"engtype", $"pageid", 
> $"testid")).as("engagements"))
> // Exiting paste mode, now interpreting.
> [Stage 53:> (0 + 32) / 32]2019-04-25 19:02:12 ERROR Executor:91 - Exception 
> in task 24.0 in stage 53.0 (TID 688)
> java.io.EOFException: Reached the end of stream with 1323959 bytes left to 
> read
> at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
> at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
> at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:107)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105)
> at 
>

[jira] [Commented] (SPARK-14543) SQL/Hive insertInto has unexpected results

2019-07-17 Thread Alexander Tronchin-James (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887508#comment-16887508
 ] 

Alexander Tronchin-James commented on SPARK-14543:
--

Did the calling syntax change for this? I'm using 2.4.x and can't find anything 
about .byName on writers in the docs, but maybe I'm just bad at searching the 
docs...

> SQL/Hive insertInto has unexpected results
> --
>
> Key: SPARK-14543
> URL: https://issues.apache.org/jira/browse/SPARK-14543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>
> *Updated description*
> There should be an option to match input data to output columns by name. The 
> API allows operations on tables, which hide the column resolution problem. 
> It's easy to copy from one table to another without listing the columns, and 
> in the API it is common to work with columns by name rather than by position. 
> I think the API should add a way to match columns by name, which is closer to 
> what users expect. I propose adding something like this:
> {code}
> CREATE TABLE src (id: bigint, count: int, total: bigint)
> CREATE TABLE dst (id: bigint, total: bigint, count: int)
> sqlContext.table("src").write.byName.insertInto("dst")
> {code}
> *Original description*
> The Hive write path adds a pre-insertion cast (projection) to reconcile 
> incoming data columns with the outgoing table schema. Columns are matched by 
> position and casts are inserted to reconcile the two column schemas.
> When columns aren't correctly aligned, this causes unexpected results. I ran 
> into this by not using a correct {{partitionBy}} call (addressed by 
> SPARK-14459), which caused an error message that an int could not be cast to 
> an array. However, if the columns are vaguely compatible, for example string 
> and float, then no error or warning is produced and data is written to the 
> wrong columns using unexpected casts (string -> bigint -> float).
> A real-world use case that will hit this is when a table definition changes 
> by adding a column in the middle of a table. Spark SQL statements that copied 
> from that table to a destination table will then map the columns differently 
> but insert casts that mask the problem. The last column's data will be 
> dropped without a reliable warning for the user.
> This highlights a few problems:
> * Too many or too few incoming data columns should cause an AnalysisException 
> to be thrown
> * Only "safe" casts should be inserted automatically, like int -> long, using 
> UpCast
> * Pre-insertion casts currently ignore extra columns by using zip
> * The pre-insertion cast logic differs between Hive's MetastoreRelation and 
> LogicalRelation
> Also, I think there should be an option to match input data to output columns 
> by name. The API allows operations on tables, which hide the column 
> resolution problem. It's easy to copy from one table to another without 
> listing the columns, and in the API it is common to work with columns by name 
> rather than by position. I think the API should add a way to match columns by 
> name, which is closer to what users expect. I propose adding something like 
> this:
> {code}
> CREATE TABLE src (id: bigint, count: int, total: bigint)
> CREATE TABLE dst (id: bigint, total: bigint, count: int)
> sqlContext.table("src").write.byName.insertInto("dst")
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28430) Some stage table rows render wrong number of columns if tasks are missing metrics

2019-07-17 Thread Josh Rosen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-28430:
---
Attachment: ui-screenshot.png

> Some stage table rows render wrong number of columns if tasks are missing 
> metrics 
> --
>
> Key: SPARK-28430
> URL: https://issues.apache.org/jira/browse/SPARK-28430
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Attachments: ui-screenshot.png
>
>
> The Spark UI's stages table renders too few columns for some tasks if a 
> subset of the tasks are missing their metrics. This is due to an 
> inconsistency in how we render certain columns: some columns gracefully 
> handle this case, but others do not. See attached screenshot below



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28430) Some stage table rows render wrong number of columns if tasks are missing metrics

2019-07-17 Thread Josh Rosen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-28430:
--

Assignee: Josh Rosen

> Some stage table rows render wrong number of columns if tasks are missing 
> metrics 
> --
>
> Key: SPARK-28430
> URL: https://issues.apache.org/jira/browse/SPARK-28430
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Attachments: ui-screenshot.png
>
>
> The Spark UI's stages table renders too few columns for some tasks if a 
> subset of the tasks are missing their metrics. This is due to an 
> inconsistency in how we render certain columns: some columns gracefully 
> handle this case, but others do not. See attached screenshot below



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28430) Some stage table rows render wrong number of columns if tasks are missing metrics

2019-07-17 Thread Josh Rosen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-28430:
---
Description: 
The Spark UI's stages table renders too few columns for some tasks if a subset 
of the tasks are missing their metrics. This is due to an inconsistency in how 
we render certain columns: some columns gracefully handle this case, but others 
do not. See attached screenshot below

 !ui-screenshot.png! 

  was:The Spark UI's stages table renders too few columns for some tasks if a 
subset of the tasks are missing their metrics. This is due to an inconsistency 
in how we render certain columns: some columns gracefully handle this case, but 
others do not. See attached screenshot below


> Some stage table rows render wrong number of columns if tasks are missing 
> metrics 
> --
>
> Key: SPARK-28430
> URL: https://issues.apache.org/jira/browse/SPARK-28430
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Attachments: ui-screenshot.png
>
>
> The Spark UI's stages table renders too few columns for some tasks if a 
> subset of the tasks are missing their metrics. This is due to an 
> inconsistency in how we render certain columns: some columns gracefully 
> handle this case, but others do not. See attached screenshot below
>  !ui-screenshot.png! 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28430) Some stage table rows render wrong number of columns if tasks are missing metrics

2019-07-17 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-28430:
--

 Summary: Some stage table rows render wrong number of columns if 
tasks are missing metrics 
 Key: SPARK-28430
 URL: https://issues.apache.org/jira/browse/SPARK-28430
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.0, 3.0.0
Reporter: Josh Rosen


The Spark UI's stages table renders too few columns for some tasks if a subset 
of the tasks are missing their metrics. This is due to an inconsistency in how 
we render certain columns: some columns gracefully handle this case, but others 
do not. See attached screenshot below



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27570) java.io.EOFException Reached the end of stream - Reading Parquet from Swift

2019-07-17 Thread Josh Rosen (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887498#comment-16887498
 ] 

Josh Rosen commented on SPARK-27570:


[~ste...@apache.org], I finally got a chance to test your {{fadvise}} 
configuration recommendation and that resolved my issue. *However*, I think 
that there's a typo in your recommendation: this only worked when I used 
{{fs.s3a.experimental.*input*.fadvise}} (the {{.input}} was missing in your 
comment).

> java.io.EOFException Reached the end of stream - Reading Parquet from Swift
> ---
>
> Key: SPARK-27570
> URL: https://issues.apache.org/jira/browse/SPARK-27570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Harry Hough
>Priority: Major
>
> I did see issue SPARK-25966 but it seems there are some differences as his 
> problem was resolved after rebuilding the parquet files on write. This is 
> 100% reproducible for me across many different days of data.
> I get exceptions such as "Reached the end of stream with 750477 bytes left to 
> read" during some read operations of parquet files. I am reading these files 
> from Openstack swift using openstack-hadoop 2.7.7 on Spark 2.4.
> The issues seem to happen with the where statement. I have also tried filter 
> and combining the statements into one as well as the dataset method with 
> column without any luck. Which column or what the actual filter is on the 
> where also doesn't seem to make a difference to the error occurring or not.
>  
> {code:java}
> val engagementDS = spark
>   .read
>   .parquet(createSwiftAddr("engagements", folder))
>   .where("engtype != 0")
>   .where("engtype != 1000")
>   .groupBy($"accid", $"sessionkey")
>   .agg(collect_list(struct($"time", $"pid", $"engtype", $"pageid", 
> $"testid")).as("engagements"))
> // Exiting paste mode, now interpreting.
> [Stage 53:> (0 + 32) / 32]2019-04-25 19:02:12 ERROR Executor:91 - Exception 
> in task 24.0 in stage 53.0 (TID 688)
> java.io.EOFException: Reached the end of stream with 1323959 bytes left to 
> read
> at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
> at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
> at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:107)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at

[jira] [Resolved] (SPARK-28417) Spark Submit does not use Proxy User Credentials to Resolve Path for Resources

2019-07-17 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-28417.

Resolution: Duplicate

> Spark Submit does not use Proxy User Credentials to Resolve Path for Resources
> --
>
> Key: SPARK-28417
> URL: https://issues.apache.org/jira/browse/SPARK-28417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 
> 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: Abhishek Modi
>Priority: Minor
>
> Starting in as of [#SPARK-21012], spark-submit supports wildcard paths (ex: 
> {{hdfs:///user/akmodi/*}}). To support these, spark-submit does a glob 
> resolution on these paths and overwrites the wildcard paths with the resolved 
> paths. This introduced a bug -  the change did not use {{proxy-user}} 
> credentials when resolving these paths. As a result, Spark 2.2 and later apps 
> fail to launch an app as a {{proxy-user}} if the paths are only readable by 
> the {{proxy-user}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28097) Map ByteType to SMALLINT when using JDBC with PostgreSQL

2019-07-17 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28097:
-

Assignee: Seth Fitzsimmons

> Map ByteType to SMALLINT when using JDBC with PostgreSQL
> 
>
> Key: SPARK-28097
> URL: https://issues.apache.org/jira/browse/SPARK-28097
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: Seth Fitzsimmons
>Assignee: Seth Fitzsimmons
>Priority: Minor
>
> PostgreSQL doesn't have {{TINYINT}}, which would map directly, but 
> {{SMALLINT}}s are sufficient for uni-directional translation (i.e. when 
> writing).
> This is equivalent to a user selecting {{'byteColumn.cast(ShortType)}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28097) Map ByteType to SMALLINT when using JDBC with PostgreSQL

2019-07-17 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28097.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24845
[https://github.com/apache/spark/pull/24845]

> Map ByteType to SMALLINT when using JDBC with PostgreSQL
> 
>
> Key: SPARK-28097
> URL: https://issues.apache.org/jira/browse/SPARK-28097
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: Seth Fitzsimmons
>Assignee: Seth Fitzsimmons
>Priority: Minor
> Fix For: 3.0.0
>
>
> PostgreSQL doesn't have {{TINYINT}}, which would map directly, but 
> {{SMALLINT}}s are sufficient for uni-directional translation (i.e. when 
> writing).
> This is equivalent to a user selecting {{'byteColumn.cast(ShortType)}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18829) Printing to logger

2019-07-17 Thread Alexander Tronchin-James (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-18829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887450#comment-16887450
 ] 

Alexander Tronchin-James commented on SPARK-18829:
--

FWIW, the showString method on datasets is private, so it doesn't seem possible 
to call except by internal Dataset methods.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L295

> Printing to logger
> --
>
> Key: SPARK-18829
> URL: https://issues.apache.org/jira/browse/SPARK-18829
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.2
> Environment: ALL
>Reporter: David Hodeffi
>Priority: Trivial
>  Labels: easyfix, patch
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> I would like to print dataframe.show or  df.explain(true)  into log file.
> right now the code print to standard output without a way to redirect it.
> It also cannot be configured on log4j.properties.
> My suggestion is to write to the logger and standard output.
> i.e 
> class DataFrame {..
> override def explain(extended: Boolean): Unit = {
> val explain = ExplainCommand(queryExecution.logical, extended = extended)
> sqlContext.executePlan(explain).executedPlan.executeCollect().foreach {
>   // scalastyle:off println
>   r => {
> println(r.getString(0))
> logger.debug(r.getString(0))
>   }
>  }
>   // scalastyle:on println
> }
>   }
> def show(numRows: Int, truncate: Boolean): Unit = {
> val str =showString(numRows, truncate) 
> println(str)
> logger.debug(str)
> }
> }



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19256) Hive bucketing support

2019-07-17 Thread Cheng Su (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887313#comment-16887313
 ] 

Cheng Su commented on SPARK-19256:
--

[~aditya.dataengg] - I started the work by submitting a PR 
([https://github.com/apache/spark/pull/23163]) for 
https://issues.apache.org/jira/browse/SPARK-26164, and it is still under 
review. I will ping relevant reviewers to see whether we can speed it up, 
thanks.

> Hive bucketing support
> --
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28429) SQL Datetime util function being casted to double instead of timestamp

2019-07-17 Thread Dylan Guedes (JIRA)

Dylan Guedes created SPARK-28429:


 Summary: SQL Datetime util function being casted to double instead 
of timestamp
 Key: SPARK-28429
 URL: https://issues.apache.org/jira/browse/SPARK-28429
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Dylan Guedes


In the code below, 'now()+'100 days' are casted to double and then an error is 
thrown:
{code:sql}
CREATE TEMP VIEW v_window AS
SELECT i, min(i) over (order by i range between '1 day' preceding and '10 days' 
following) as min_i
FROM range(now(), now()+'100 days', '1 hour') i;
{code}
Error:

{code:sql}
cannot resolve '(current_timestamp() + CAST('100 days' AS DOUBLE))' due to data 
type mismatch: differing      types in '(current_timestamp() + CAST('100 days' 
AS DOUBLE))' (timestamp and double).;{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28288) Convert and port 'window.sql' into UDF test base

2019-07-17 Thread YoungGyu Chun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887294#comment-16887294
 ] 

YoungGyu Chun edited comment on SPARK-28288 at 7/17/19 6:11 PM:


hi [~hyukjin.kwon],

After merging SPARK-28359 I still see some errors from query 11 to query 16:
{code:sql}
--- a/sql/core/src/test/resources/sql-tests/results/window.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-window.sql.out
@@ -21,10 +21,10 @@ struct<>


 -- !query 1
-SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val ROWS CURRENT 
ROW) FROM testData
+SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY val ROWS 
CURRENT ROW) FROM testData
 ORDER BY cate, val
 -- !query 1 schema
-struct
+struct
 -- !query 1 output
 NULL   NULL0
 3  NULL1
@@ -38,10 +38,10 @@ NULLa   0


 -- !query 2
-SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val
+SELECT udf(val), cate, sum(val) OVER(PARTITION BY cate ORDER BY val
 ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) FROM testData ORDER BY cate, 
val
 -- !query 2 schema
-struct
+struct
 -- !query 2 output
 NULL   NULL3
 3  NULL3
@@ -55,20 +55,27 @@ NULLa   1


 -- !query 3
-SELECT val_long, cate, sum(val_long) OVER(PARTITION BY cate ORDER BY val_long
-ROWS BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY 
cate, val_long
+SELECT val_long, udf(cate), sum(val_long) OVER(PARTITION BY cate ORDER BY 
val_long
+ROWS BETWEEN CURRENT ROW AND CAST(2147483648 AS int) FOLLOWING) FROM testData 
ORDER BY cate, val_long
 -- !query 3 schema
-struct<>
+struct
 -- !query 3 output
-org.apache.spark.sql.AnalysisException
-cannot resolve 'ROWS BETWEEN CURRENT ROW AND 2147483648L FOLLOWING' due to 
data type mismatch: The data type of the upper bound 'bigint' does not match 
the expected data type 'int'.; line 1 pos 41
+NULL   NULL1
+1  NULL1
+1  a   2147483654
+1  a   2147483653
+2  a   2147483652
+2147483650 a   2147483650
+NULL   b   2147483653
+3  b   2147483653
+2147483650 b   2147483650


 -- !query 4
-SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val RANGE 1 
PRECEDING) FROM testData
+SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY val RANGE 1 
PRECEDING) FROM testData
 ORDER BY cate, val
 -- !query 4 schema
-struct
+struct
 -- !query 4 output
 NULL   NULL0
 3  NULL1
@@ -82,10 +89,10 @@ NULLa   0


 -- !query 5
-SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val
+SELECT val, udf(cate), sum(val) OVER(PARTITION BY cate ORDER BY val
 RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val
 -- !query 5 schema
-struct
+struct
 -- !query 5 output
 NULL   NULLNULL
 3  NULL3
@@ -99,10 +106,10 @@ NULL   a   NULL


 -- !query 6
-SELECT val_long, cate, sum(val_long) OVER(PARTITION BY cate ORDER BY val_long
+SELECT val_long, udf(cate), sum(val_long) OVER(PARTITION BY cate ORDER BY 
val_long
 RANGE BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY 
cate, val_long
 -- !query 6 schema
-struct
+struct
 -- !query 6 output
 NULL   NULLNULL
 1  NULL1
@@ -116,10 +123,10 @@ NULL  b   NULL


 -- !query 7
-SELECT val_double, cate, sum(val_double) OVER(PARTITION BY cate ORDER BY 
val_double
+SELECT val_double, udf(cate), sum(val_double) OVER(PARTITION BY cate ORDER BY 
val_double
 RANGE BETWEEN CURRENT ROW AND 2.5 FOLLOWING) FROM testData ORDER BY cate, 
val_double
 -- !query 7 schema
-struct
+struct
 -- !query 7 output
 NULL   NULLNULL
 1.0NULL1.0
@@ -133,10 +140,10 @@ NULL  NULLNULL


 -- !query 8
-SELECT val_date, cate, max(val_date) OVER(PARTITION BY cate ORDER BY val_date
+SELECT val_date, udf(cate), max(val_date) OVER(PARTITION BY cate ORDER BY 
val_date
 RANGE BETWEEN CURRENT ROW AND 2 FOLLOWING) FROM testData ORDER BY cate, 
val_date
 -- !query 8 schema
-struct
+struct
 -- !query 8 output
 NULL   NULLNULL
 2017-08-01 NULL2017-08-01
@@ -150,11 +157,11 @@ NULL  NULLNULL


 -- !query 9
-SELECT val_timestamp, cate, avg(val_timestamp) OVER(PARTITION BY cate ORDER BY 
val_timestamp
+SELECT val_timestamp, udf(cate), avg(val_timestamp) OVER(PARTITION BY cate 
ORDER BY val_timestamp
 RANGE BETWEEN CURRENT ROW AND interval 23 days 4 hours FOLLOWING) FROM testData
 ORDER BY cate, val_timestamp
 -- !query 9 schema
-struct
+struct
 -- !query 9 output
 NULL   NULLNULL
 2017-07-31 17:00:00NULL1.5015456E9
@@ -168,10 +175,10 @@ NULL  NULLNULL


 -- !query 10
-SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val DESC
+SELECT val, udf(cate), sum(val) OVER(PARTITION BY cate ORDER BY val DESC
 RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val
 -- !query 10 schema
-struct
+struct
 -- !query 10 output
 NULL

[jira] [Created] (SPARK-28428) Spark `exclude` always expecting `()`

2019-07-17 Thread Dylan Guedes (JIRA)

Dylan Guedes created SPARK-28428:


 Summary: Spark `exclude` always expecting `()` 
 Key: SPARK-28428
 URL: https://issues.apache.org/jira/browse/SPARK-28428
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Dylan Guedes


SparkSQL `exclude` always expects a following call to `()`, however, PgSQL 
`exclude` does not. Examples:

{code:sql}
SELECT sum(unique1) over (rows between 2 preceding and 2 following exclude no 
others),
unique1, four
FROM tenk1 WHERE unique1 < 10;
{code}




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28288) Convert and port 'window.sql' into UDF test base

2019-07-17 Thread YoungGyu Chun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887294#comment-16887294
 ] 

YoungGyu Chun commented on SPARK-28288:
---

hi [~hyukjin.kwon],

After merging SPARK-28359 I still see some errors from query 11 to query 16:
{code:sql}
--- a/sql/core/src/test/resources/sql-tests/results/window.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-window.sql.out
@@ -21,10 +21,10 @@ struct<>


 -- !query 1
-SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val ROWS CURRENT 
ROW) FROM testData
+SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY val ROWS 
CURRENT ROW) FROM testData
 ORDER BY cate, val
 -- !query 1 schema
-struct
+struct
 -- !query 1 output
 NULL   NULL0
 3  NULL1
@@ -38,10 +38,10 @@ NULLa   0


 -- !query 2
-SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val
+SELECT udf(val), cate, sum(val) OVER(PARTITION BY cate ORDER BY val
 ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) FROM testData ORDER BY cate, 
val
 -- !query 2 schema
-struct
+struct
 -- !query 2 output
 NULL   NULL3
 3  NULL3
@@ -55,20 +55,27 @@ NULLa   1


 -- !query 3
-SELECT val_long, cate, sum(val_long) OVER(PARTITION BY cate ORDER BY val_long
-ROWS BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY 
cate, val_long
+SELECT val_long, udf(cate), sum(val_long) OVER(PARTITION BY cate ORDER BY 
val_long
+ROWS BETWEEN CURRENT ROW AND CAST(2147483648 AS int) FOLLOWING) FROM testData 
ORDER BY cate, val_long
 -- !query 3 schema
-struct<>
+struct
 -- !query 3 output
-org.apache.spark.sql.AnalysisException
-cannot resolve 'ROWS BETWEEN CURRENT ROW AND 2147483648L FOLLOWING' due to 
data type mismatch: The data type of the upper bound 'bigint' does not match 
the expected data type 'int'.; line 1 pos 41
+NULL   NULL1
+1  NULL1
+1  a   2147483654
+1  a   2147483653
+2  a   2147483652
+2147483650 a   2147483650
+NULL   b   2147483653
+3  b   2147483653
+2147483650 b   2147483650


 -- !query 4
-SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val RANGE 1 
PRECEDING) FROM testData
+SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY val RANGE 1 
PRECEDING) FROM testData
 ORDER BY cate, val
 -- !query 4 schema
-struct
+struct
 -- !query 4 output
 NULL   NULL0
 3  NULL1
@@ -82,10 +89,10 @@ NULLa   0


 -- !query 5
-SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val
+SELECT val, udf(cate), sum(val) OVER(PARTITION BY cate ORDER BY val
 RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val
 -- !query 5 schema
-struct
+struct
 -- !query 5 output
 NULL   NULLNULL
 3  NULL3
@@ -99,10 +106,10 @@ NULL   a   NULL


 -- !query 6
-SELECT val_long, cate, sum(val_long) OVER(PARTITION BY cate ORDER BY val_long
+SELECT val_long, udf(cate), sum(val_long) OVER(PARTITION BY cate ORDER BY 
val_long
 RANGE BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY 
cate, val_long
 -- !query 6 schema
-struct
+struct
 -- !query 6 output
 NULL   NULLNULL
 1  NULL1
@@ -116,10 +123,10 @@ NULL  b   NULL


 -- !query 7
-SELECT val_double, cate, sum(val_double) OVER(PARTITION BY cate ORDER BY 
val_double
+SELECT val_double, udf(cate), sum(val_double) OVER(PARTITION BY cate ORDER BY 
val_double
 RANGE BETWEEN CURRENT ROW AND 2.5 FOLLOWING) FROM testData ORDER BY cate, 
val_double
 -- !query 7 schema
-struct
+struct
 -- !query 7 output
 NULL   NULLNULL
 1.0NULL1.0
@@ -133,10 +140,10 @@ NULL  NULLNULL


 -- !query 8
-SELECT val_date, cate, max(val_date) OVER(PARTITION BY cate ORDER BY val_date
+SELECT val_date, udf(cate), max(val_date) OVER(PARTITION BY cate ORDER BY 
val_date
 RANGE BETWEEN CURRENT ROW AND 2 FOLLOWING) FROM testData ORDER BY cate, 
val_date
 -- !query 8 schema
-struct
+struct
 -- !query 8 output
 NULL   NULLNULL
 2017-08-01 NULL2017-08-01
@@ -150,11 +157,11 @@ NULL  NULLNULL


 -- !query 9
-SELECT val_timestamp, cate, avg(val_timestamp) OVER(PARTITION BY cate ORDER BY 
val_timestamp
+SELECT val_timestamp, udf(cate), avg(val_timestamp) OVER(PARTITION BY cate 
ORDER BY val_timestamp
 RANGE BETWEEN CURRENT ROW AND interval 23 days 4 hours FOLLOWING) FROM testData
 ORDER BY cate, val_timestamp
 -- !query 9 schema
-struct
+struct
 -- !query 9 output
 NULL   NULLNULL
 2017-07-31 17:00:00NULL1.5015456E9
@@ -168,10 +175,10 @@ NULL  NULLNULL


 -- !query 10
-SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val DESC
+SELECT val, udf(cate), sum(val) OVER(PARTITION BY cate ORDER BY val DESC
 RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val
 -- !query 10 schema
-struct
+struct
 -- !query 10 output
 NULL   NULLNULL
 3  NULL3
@@ -185,57

[jira] [Created] (SPARK-28427) Support more Postgres JSON functions

2019-07-17 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-28427:
--

 Summary: Support more Postgres JSON functions
 Key: SPARK-28427
 URL: https://issues.apache.org/jira/browse/SPARK-28427
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Josh Rosen


Postgres features a number of JSON functions that are missing in Spark: 
https://www.postgresql.org/docs/9.3/functions-json.html

Redshift's JSON functions 
(https://docs.aws.amazon.com/redshift/latest/dg/json-functions.html) have 
partial overlap with the Postgres list.

Some of these functions can be expressed in terms of compositions of existing 
Spark functions. For example, I think that {{json_array_length}} can be 
expressed with {{cardinality}} and {{from_json}}, but there's a caveat related 
to legacy Hive compatibility (see the demo notebook at 
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5796212617691211/45530874214710/4901752417050771/latest.html
 for more details).

I'm filing this ticket so that we can triage the list of Postgres JSON features 
and decide which ones make sense to support in Spark. After we've done that, we 
can create individual tickets for specific functions and features.





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28241) Show metadata operations on ThriftServerTab

2019-07-17 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28241:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-28426

> Show metadata operations on ThriftServerTab
> ---
>
> Key: SPARK-28241
> URL: https://issues.apache.org/jira/browse/SPARK-28241
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> !https://user-images.githubusercontent.com/5399861/60579741-4cd2c180-9db6-11e9-822a-0433be509b67.png!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24570) SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel SQL, DBVisualizer.etc)

2019-07-17 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24570:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-28426

> SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel 
> SQL, DBVisualizer.etc)
> ---
>
> Key: SPARK-24570
> URL: https://issues.apache.org/jira/browse/SPARK-24570
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: t oo
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: connect-to-sql-db-ssms-locate-table.png, hive.png, 
> spark.png
>
>
> An end-user SQL client tool (ie in the screenshot) can list tables from 
> hiveserver2 and major DBs (Mysql, postgres,oracle, MSSQL..etc). But with 
> SparkSQL it does not display any tables. This would be very convenient for 
> users.
> This is the exception in the client tool (Aqua Data Studio):
> {code:java}
> Title: An Error Occurred
> Summary: Unable to Enumerate Result
>  Start Message 
> 
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input '`*`' expecting STRING(line 1, pos 38)
> == SQL ==
> SHOW TABLE EXTENDED FROM sit1_pb LIKE `*`
> --^^^
>  End Message 
> 
>  Start Stack Trace 
> 
> java.sql.SQLException: org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input '`*`' expecting STRING(line 1, pos 38)
> == SQL ==
> SHOW TABLE EXTENDED FROM sit1_pb LIKE `*`
> --^^^
>   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296)
>   at com.aquafold.aquacore.open.rdbms.drivers.hive.Qꐨꈬꈦꁐ.execute(Unknown 
> Source)
>   at \\.\\.\\हिñçêČάй語简�?한\\.gᚵ᠃᠍ꃰint.execute(Unknown Source)
>   at com.common.ui.tree.hꐊᠱꇗꇐ9int.yW(Unknown Source)
>   at com.common.ui.tree.hꐊᠱꇗꇐ9int$1.process(Unknown Source)
>   at com.common.ui.util.BackgroundThread.run(Unknown Source)
>  End Stack Trace 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24196) Spark Thrift Server - SQL Client connections does't show db artefacts

2019-07-17 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24196:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-28426

> Spark Thrift Server - SQL Client connections does't show db artefacts
> -
>
> Key: SPARK-24196
> URL: https://issues.apache.org/jira/browse/SPARK-24196
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: rr
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: screenshot-1.png
>
>
> When connecting to Spark Thrift Server via JDBC artefacts(db objects are not 
> showing up)
> whereas when connecting to hiveserver2 it shows the schema, tables, columns 
> ...
> SQL Client user: IBM Data Studio, DBeaver SQL Client



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28104) Implement Spark's own GetColumnsOperation

2019-07-17 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28104:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-28426

> Implement Spark's own GetColumnsOperation
> -
>
> Key: SPARK-28104
> URL: https://issues.apache.org/jira/browse/SPARK-28104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-24196 and SPARK-24570 implemented Spark's own {{GetSchemasOperation}} 
> and {{GetTablesOperation}}. We also need implement Spark's own 
> {{GetColumnsOperation}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28426) Metadata Handling in Thrift Server

2019-07-17 Thread Xiao Li (JIRA)

Xiao Li created SPARK-28426:
---

 Summary: Metadata Handling in Thrift Server
 Key: SPARK-28426
 URL: https://issues.apache.org/jira/browse/SPARK-28426
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.0.0
Reporter: Xiao Li


Currently, only the `executeStatement is handled` for SQL commands. But others 
like `getTables`, `getSchemas`, `getColumns` and so on fallback to an in-memory 
derby empty catalog. Then some of BI tools could not show the correct object 
information. 

 

This umbrella Jira is tracking the related improvement of Thrift-server. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28425) Add more Date/Time Operators

2019-07-17 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28425:

Description: 
||Operator||Example||Result||
|{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 
01:00:00'}}|
|{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 
23:00:00'}}|
|{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 
12:00'}}|{{interval '1 day 15:00:00'}}|
|{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}|
|{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}|
|{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}|
|{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}|


https://www.postgresql.org/docs/11/functions-datetime.html

  was:
||Operator||Example||Result||
|{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 
01:00:00'}}|
|{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 
23:00:00'}}|
|{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 
12:00'}}|{{interval '1 day 15:00:00'}}|
|{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}|
|{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}|
|{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}|
|{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}|


> Add more Date/Time Operators
> 
>
> Key: SPARK-28425
> URL: https://issues.apache.org/jira/browse/SPARK-28425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Operator||Example||Result||
> |{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 
> 01:00:00'}}|
> |{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 
> 23:00:00'}}|
> |{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 
> 12:00'}}|{{interval '1 day 15:00:00'}}|
> |{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}|
> |{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}|
> |{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}|
> |{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}|
> https://www.postgresql.org/docs/11/functions-datetime.html



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28425) Add more Date/Time Operators

2019-07-17 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-28425:
---

 Summary: Add more Date/Time Operators
 Key: SPARK-28425
 URL: https://issues.apache.org/jira/browse/SPARK-28425
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


||Operator||Example||Result||
|{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 
01:00:00'}}|
|{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 
23:00:00'}}|
|{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 
12:00'}}|{{interval '1 day 15:00:00'}}|
|{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}|
|{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}|
|{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}|
|{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}|



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28424) Improve interval input

2019-07-17 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28424:

Description: 
Example:
{code:sql}
INTERVAL '1 day 2:03:04'
{code}

https://www.postgresql.org/docs/11/datatype-datetime.html

  was:
Example:
{code:sql}
interval '1 hour'
{code}


>  Improve interval input
> ---
>
> Key: SPARK-28424
> URL: https://issues.apache.org/jira/browse/SPARK-28424
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Example:
> {code:sql}
> INTERVAL '1 day 2:03:04'
> {code}
> https://www.postgresql.org/docs/11/datatype-datetime.html



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28424) Improve interval input

2019-07-17 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28424:

Summary:  Improve interval input  (was: interval accept string input)

>  Improve interval input
> ---
>
> Key: SPARK-28424
> URL: https://issues.apache.org/jira/browse/SPARK-28424
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Example:
> {code:sql}
> interval '1 hour'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28424) interval accept string input

2019-07-17 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28424:

Description: 
Example:
{code:sql}
interval '1 hour'
{code}

> interval accept string input
> 
>
> Key: SPARK-28424
> URL: https://issues.apache.org/jira/browse/SPARK-28424
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Example:
> {code:sql}
> interval '1 hour'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28424) interval accept string input

2019-07-17 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-28424:
---

 Summary: interval accept string input
 Key: SPARK-28424
 URL: https://issues.apache.org/jira/browse/SPARK-28424
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28423) merge Scan and Batch/Stream

2019-07-17 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-28423:
---

 Summary: merge Scan and Batch/Stream
 Key: SPARK-28423
 URL: https://issues.apache.org/jira/browse/SPARK-28423
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28359) Make integrated UDF tests robust by making them no-op

2019-07-17 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28359.
-
   Resolution: Fixed
 Assignee: Hyukjin Kwon
Fix Version/s: 3.0.0

> Make integrated UDF tests robust by making them no-op
> -
>
> Key: SPARK-28359
> URL: https://issues.apache.org/jira/browse/SPARK-28359
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> Current UDFs available in `IntegratedUDFTestUtils` are not exactly no-op. It 
> converts input column to strings and outputs to strings.
> This causes many issues, for instance, 
> https://github.com/apache/spark/pull/25128 or 
> https://github.com/apache/spark/pull/25110
> Ideally we should make this UDF virtually noop.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause

2019-07-17 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-28422:
---
Description: 
 
{code:java}
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def max_udf(v):
return v.max()

df = spark.range(0, 100)
df.udf.register('max_udf', max_udf)
df.createTempView('table')

# A. This works
df.agg(max_udf(df['id'])).show()

# B. This doesn't work
spark.sql("select max_udf(id) from table").show(){code}
 

 

Query plan:

A:
{code:java}
== Parsed Logical Plan ==

'Aggregate [max_udf('id) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Analyzed Logical Plan ==

max_udf(id): double

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Optimized Logical Plan ==

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Physical Plan ==

!AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]

+- Exchange SinglePartition

   +- *(1) Range (0, 1000, step=1, splits=4)
{code}
B:
{code:java}
== Parsed Logical Plan ==

'Project [unresolvedalias('max_udf('id), None)]

+- 'UnresolvedRelation [table]




== Analyzed Logical Plan ==

max_udf(id): double

Project [max_udf(id#0L) AS max_udf(id)#136]

+- SubqueryAlias `table`

   +- Range (0, 100, step=1, splits=Some(4))




== Optimized Logical Plan ==

Project [max_udf(id#0L) AS max_udf(id)#136]

+- Range (0, 100, step=1, splits=Some(4))




== Physical Plan ==

*(1) Project [max_udf(id#0L) AS max_udf(id)#136]

+- *(1) Range (0, 100, step=1, splits=4)
{code}
 

  was:
 
{code:java}
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def max_udf(v):
return v.max()

df = spark.range(0, 100)
df.udf.register('max_udf', max_udf)
df.createTempView('table')

# A. This works
df.agg(max_udf(df['id'])).show()

# B. This doesn't work
spark.sql("select max_udf(id) from table").show(){code}
 

 

Query plan:

A:
{code:java}
== Parsed Logical Plan ==

'Aggregate [max_udf('id) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Analyzed Logical Plan ==

max_udf(id): double

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Optimized Logical Plan ==

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Physical Plan ==

!AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]

+- Exchange SinglePartition

   +- *(1) Range (0, 1000, step=1, splits=4)
{code}
B:
{code:java}
== Parsed Logical Plan ==

'Project [unresolvedalias('max_udf('id), None)]

+- 'UnresolvedRelation [table]




== Analyzed Logical Plan ==

max_udf(id): double

Project [max_udf(id#0L) AS max_udf(id)#136]

+- SubqueryAlias `table`

   +- Range (0, 100, step=1, splits=Some(4))




== Optimized Logical Plan ==

Project [max_udf(id#0L) AS max_udf(id)#136]

+- Range (0, 100, step=1, splits=Some(4))




== Physical Plan ==

*(1) Project [max_udf(id#0L) AS max_udf(id)#136]

+- *(1) Range (0, 100, step=1, splits=4)
{code}
Maybe related to subquery?


> GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
> ---
>
> Key: SPARK-28422
> URL: https://issues.apache.org/jira/browse/SPARK-28422
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
>  
> {code:java}
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def max_udf(v):
> return v.max()
> df = spark.range(0, 100)
> df.udf.register('max_udf', max_udf)
> df.createTempView('table')
> # A. This works
> df.agg(max_udf(df['id'])).show()
> # B. This doesn't work
> spark.sql("select max_udf(id) from table").show(){code}
>  
>  
> Query plan:
> A:
> {code:java}
> == Parsed Logical Plan ==
> 'Aggregate [max_udf('id) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> max_udf(id): double
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Physical Plan ==
> !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]
> +- Exchange SinglePartition
>    +- *(1) Range (0, 1000, step=1, splits=4)
> {code}
> B:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('max_udf('id), None)]
> +- 'UnresolvedRelation [table]
> == Analyzed Logical Plan ==
> max_udf(id): double
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- SubqueryAlias `table`
>    +- Range (0, 100, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- Range (0, 100, step=1, splits=Some(4))
> == Physical Plan ==
>

[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause

2019-07-17 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-28422:
---
Summary: GROUPED_AGG pandas_udf doesn't with spark.sql() without group by 
clause  (was: GROUPED_AGG pandas_udf doesn't with spark.sql without group by 
clause)

> GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
> ---
>
> Key: SPARK-28422
> URL: https://issues.apache.org/jira/browse/SPARK-28422
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
>  
> {code:java}
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def max_udf(v):
> return v.max()
> df = spark.range(0, 100)
> df.udf.register('max_udf', max_udf)
> df.createTempView('table')
> # A. This works
> df.agg(max_udf(df['id'])).show()
> # B. This doesn't work
> spark.sql("select max_udf(id) from table"){code}
>  
>  
> Query plan:
> A:
> {code:java}
> == Parsed Logical Plan ==
> 'Aggregate [max_udf('id) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> max_udf(id): double
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Physical Plan ==
> !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]
> +- Exchange SinglePartition
>    +- *(1) Range (0, 1000, step=1, splits=4)
> {code}
> B:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('max_udf('id), None)]
> +- 'UnresolvedRelation [table]
> == Analyzed Logical Plan ==
> max_udf(id): double
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- SubqueryAlias `table`
>    +- Range (0, 100, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- Range (0, 100, step=1, splits=Some(4))
> == Physical Plan ==
> *(1) Project [max_udf(id#0L) AS max_udf(id)#136]
> +- *(1) Range (0, 100, step=1, splits=4)
> {code}
> Maybe related to subquery?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause

2019-07-17 Thread Li Jin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-28422:
---
Description: 
 
{code:java}
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def max_udf(v):
return v.max()

df = spark.range(0, 100)
df.udf.register('max_udf', max_udf)
df.createTempView('table')

# A. This works
df.agg(max_udf(df['id'])).show()

# B. This doesn't work
spark.sql("select max_udf(id) from table").show(){code}
 

 

Query plan:

A:
{code:java}
== Parsed Logical Plan ==

'Aggregate [max_udf('id) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Analyzed Logical Plan ==

max_udf(id): double

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Optimized Logical Plan ==

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Physical Plan ==

!AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]

+- Exchange SinglePartition

   +- *(1) Range (0, 1000, step=1, splits=4)
{code}
B:
{code:java}
== Parsed Logical Plan ==

'Project [unresolvedalias('max_udf('id), None)]

+- 'UnresolvedRelation [table]




== Analyzed Logical Plan ==

max_udf(id): double

Project [max_udf(id#0L) AS max_udf(id)#136]

+- SubqueryAlias `table`

   +- Range (0, 100, step=1, splits=Some(4))




== Optimized Logical Plan ==

Project [max_udf(id#0L) AS max_udf(id)#136]

+- Range (0, 100, step=1, splits=Some(4))




== Physical Plan ==

*(1) Project [max_udf(id#0L) AS max_udf(id)#136]

+- *(1) Range (0, 100, step=1, splits=4)
{code}
Maybe related to subquery?

  was:
 
{code:java}
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def max_udf(v):
return v.max()

df = spark.range(0, 100)
df.udf.register('max_udf', max_udf)
df.createTempView('table')

# A. This works
df.agg(max_udf(df['id'])).show()

# B. This doesn't work
spark.sql("select max_udf(id) from table"){code}
 

 

Query plan:

A:
{code:java}
== Parsed Logical Plan ==

'Aggregate [max_udf('id) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Analyzed Logical Plan ==

max_udf(id): double

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Optimized Logical Plan ==

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Physical Plan ==

!AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]

+- Exchange SinglePartition

   +- *(1) Range (0, 1000, step=1, splits=4)
{code}
B:
{code:java}
== Parsed Logical Plan ==

'Project [unresolvedalias('max_udf('id), None)]

+- 'UnresolvedRelation [table]




== Analyzed Logical Plan ==

max_udf(id): double

Project [max_udf(id#0L) AS max_udf(id)#136]

+- SubqueryAlias `table`

   +- Range (0, 100, step=1, splits=Some(4))




== Optimized Logical Plan ==

Project [max_udf(id#0L) AS max_udf(id)#136]

+- Range (0, 100, step=1, splits=Some(4))




== Physical Plan ==

*(1) Project [max_udf(id#0L) AS max_udf(id)#136]

+- *(1) Range (0, 100, step=1, splits=4)
{code}
Maybe related to subquery?


> GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
> ---
>
> Key: SPARK-28422
> URL: https://issues.apache.org/jira/browse/SPARK-28422
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
>  
> {code:java}
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def max_udf(v):
> return v.max()
> df = spark.range(0, 100)
> df.udf.register('max_udf', max_udf)
> df.createTempView('table')
> # A. This works
> df.agg(max_udf(df['id'])).show()
> # B. This doesn't work
> spark.sql("select max_udf(id) from table").show(){code}
>  
>  
> Query plan:
> A:
> {code:java}
> == Parsed Logical Plan ==
> 'Aggregate [max_udf('id) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> max_udf(id): double
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Physical Plan ==
> !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]
> +- Exchange SinglePartition
>    +- *(1) Range (0, 1000, step=1, splits=4)
> {code}
> B:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('max_udf('id), None)]
> +- 'UnresolvedRelation [table]
> == Analyzed Logical Plan ==
> max_udf(id): double
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- SubqueryAlias `table`
>    +- Range (0, 100, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- Range (0, 100, step=1, splits=Some(4))
> ==

[jira] [Created] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql without group by clause

2019-07-17 Thread Li Jin (JIRA)

Li Jin created SPARK-28422:
--

 Summary: GROUPED_AGG pandas_udf doesn't with spark.sql without 
group by clause
 Key: SPARK-28422
 URL: https://issues.apache.org/jira/browse/SPARK-28422
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.4.3
Reporter: Li Jin


 
{code:java}
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def max_udf(v):
return v.max()

df = spark.range(0, 100)
df.udf.register('max_udf', max_udf)
df.createTempView('table')

# A. This works
df.agg(max_udf(df['id'])).show()

# B. This doesn't work
spark.sql("select max_udf(id) from table"){code}
 

 

Query plan:

A:
{code:java}
== Parsed Logical Plan ==

'Aggregate [max_udf('id) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Analyzed Logical Plan ==

max_udf(id): double

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Optimized Logical Plan ==

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Physical Plan ==

!AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]

+- Exchange SinglePartition

   +- *(1) Range (0, 1000, step=1, splits=4)
{code}
B:
{code:java}
== Parsed Logical Plan ==

'Project [unresolvedalias('max_udf('id), None)]

+- 'UnresolvedRelation [table]




== Analyzed Logical Plan ==

max_udf(id): double

Project [max_udf(id#0L) AS max_udf(id)#136]

+- SubqueryAlias `table`

   +- Range (0, 100, step=1, splits=Some(4))




== Optimized Logical Plan ==

Project [max_udf(id#0L) AS max_udf(id)#136]

+- Range (0, 100, step=1, splits=Some(4))




== Physical Plan ==

*(1) Project [max_udf(id#0L) AS max_udf(id)#136]

+- *(1) Range (0, 100, step=1, splits=4)
{code}
Maybe related to subquery?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24283) Make standard scaler work without legacy MLlib

2019-07-17 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24283.
---
Resolution: Duplicate

> Make standard scaler work without legacy MLlib
> --
>
> Key: SPARK-24283
> URL: https://issues.apache.org/jira/browse/SPARK-24283
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: holdenk
>Priority: Trivial
>  Labels: starter
>
> Currently StandardScaler converts Spark ML vectors to MLlib vectors during 
> prediction, we should skip that step.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28246) State of UDAF: buffer is not cleared

2019-07-17 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887005#comment-16887005
 ] 

Hyukjin Kwon commented on SPARK-28246:
--

It's implementation details. 

It's documented as below:

{code}
   * The contract should be that applying the merge function on two initial 
buffers should just
   * return the initial buffer itself, i.e.
   * `merge(initialBuffer, initialBuffer)` should equal `initialBuffer`.
{code}

If we don't do anything within the initialization, it won't meet this contract.

> State of UDAF: buffer is not cleared
> 
>
> Key: SPARK-28246
> URL: https://issues.apache.org/jira/browse/SPARK-28246
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: Ubuntu Linux 16.04
> Reproducible with option --master local[1]
> {code:java}
> $ spark-shell --master local[1]
> {code}
>Reporter: Pavel Parkhomenko
>Priority: Major
>
> Buffer object for UserDefinedAggregateFunction contains data from previous 
> iteration. For example,
> {code:java}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions.callUDF
> import java.util.Arrays.asList
> val df = spark.createDataFrame(
>   asList(
>     Row(1, "a"),
>     Row(2, "b")),
>   StructType(List(
>     StructField("id", IntegerType),
>     StructField("value", StringType
> trait Min extends UserDefinedAggregateFunction {
>   override val inputSchema: StructType = 
> StructType(Array(StructField("value", StringType)))
>   override val bufferSchema: StructType = StructType(Array(StructField("min", 
> StringType)))
>   override def dataType: DataType = StringType
>   override def deterministic: Boolean = true
>   override def update(buffer: MutableAggregationBuffer, input: Row): Unit =
>     if (input(0) != null && (buffer(0) == null || buffer.getString(0) > 
> input.getString(0))) buffer(0) = input(0)
>   override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = 
> update(buffer1, buffer2)
>   override def evaluate(buffer: Row): Any = buffer(0)
> }
> class GoodMin extends Min {
>   override def initialize(buffer: MutableAggregationBuffer): Unit = buffer(0) 
> = None
> }
> class BadMin extends Min {
>   override def initialize(buffer: MutableAggregationBuffer): Unit = {}
> }
> spark.udf.register("goodmin", new GoodMin)
> spark.udf.register("badmin", new BadMin)
> df groupBy "id" agg callUDF("goodmin", $"value") show false
> df groupBy "id" agg callUDF("badmin", $"value") show false
> {code}
> Output is
> {noformat}
> scala> df groupBy "id" agg callUDF("goodmin", $"value") show false
> +---+--+
> |id |goodmin(value)|
> +---+--+
> |1  |a |
> |2  |b |
> +---+--+
> scala> df groupBy "id" agg callUDF("badmin", $"value") show false
> +---+-+
> |id |badmin(value)|
> +---+-+
> |1  |a    |
> |2  |a    |
> +---+-+
> {noformat}
> The difference between GoodMin and BadMin is a buffer initialization.
> *This example could be reproduced with a single worker thread only*. To 
> reproduce it is mandatory to run spark shell with option
> {code:java}
> spark-shell --master local[1]
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly

2019-07-17 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27027.
--
Resolution: Duplicate

Seems a duplicate of SPARK-27798

> from_avro function does not deserialize the Avro record of a struct column 
> type correctly
> -
>
> Key: SPARK-27027
> URL: https://issues.apache.org/jira/browse/SPARK-27027
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Hien Luu
>Priority: Minor
>
> {{from_avro}} function produces wrong output of a struct field.  See the 
> output at the bottom of the description
> {code}
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.avro._
> import org.apache.spark.sql.functions._
> spark.version
> val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", 
> 50)).toDF("id", "name", "age")
> val dfStruct = df.withColumn("value", struct("name","age"))
> dfStruct.show
> dfStruct.printSchema
> val dfKV = dfStruct.select(to_avro('id).as("key"), 
> to_avro('value).as("value"))
> val expectedSchema = StructType(Seq(StructField("name", StringType, 
> true),StructField("age", IntegerType, false)))
> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
> val avroTypeStr = s"""
>  |{
>  | "type": "int",
>  | "name": "key"
>  |}
>  """.stripMargin
> dfKV.select(from_avro('key, avroTypeStr)).show
> dfKV.select(from_avro('value, avroTypeStruct)).show
> // output for the last statement and that is not correct
> +-+
> |from_avro(value, struct)|
> +-+
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> | [Josh Duke, 50]|
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27820) case insensitive resolver should be used in GetMapValue

2019-07-17 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27820.
--
Resolution: Won't Fix

> case insensitive resolver should be used in GetMapValue
> ---
>
> Key: SPARK-27820
> URL: https://issues.apache.org/jira/browse/SPARK-27820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Michel Lemay
>Priority: Minor
>
> When extracting a key value from a MapType, it calls GetMapValue 
> (complexTypeExtractors.scala) and only use the map type ordering. It should 
> use the resolver instead.
> Starting spark with: `{{spark-shell --conf spark.sql.caseSensitive=false`}}
> Given dataframe:
>  {{val df = List(Map("a" -> 1), Map("A" -> 2)).toDF("m")}}
> And executing any of these will only return one row: case insensitive in the 
> name of the column but case sensitive match in the keys of the map.
> {{df.filter($"M.A".isNotNull).count}}
>  {{df.filter($"M"("A").isNotNull).count 
> df.filter($"M".getField("A").isNotNull).count}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28421) SparseVector.apply performance optimization

2019-07-17 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-28421:


 Summary: SparseVector.apply performance optimization
 Key: SPARK-28421
 URL: https://issues.apache.org/jira/browse/SPARK-28421
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


Current impl of SparseVector.apply is inefficient:

on each call,  breeze.linalg.SparseVector & 
breeze.collection.mutable.SparseArray are created internally, then 
binary-search is used to search the input position.

 

This place should be optimized like .ml.SparseMatrix, which directly use binary 
search, without conversion to breeze.linalg.Matrix.

 

I tested the performance and found that if we avoid the internal conversions, 
then a 2.5~5X speed up can be obtained.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API

2019-07-17 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886913#comment-16886913
 ] 

Gabor Somogyi commented on SPARK-28415:
---

Kafka 0.8 support is deprecated as of Spark 2.3.0.

> Add messageHandler to Kafka 10 direct stream API
> 
>
> Key: SPARK-28415
> URL: https://issues.apache.org/jira/browse/SPARK-28415
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.4.3
>Reporter: Michael Spector
>Priority: Major
>
> Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new 
> Kafka API is what prevents us from upgrading our processes to use it, and 
> here's why:
>  # messageHandler() allowed parsing / filtering / projecting huge JSON files 
> at an early stage (only a small subset of JSON fields is required for a 
> process), without this current cluster configuration doesn't keep up with the 
> traffic.
>  # Transforming Kafka events right after a stream is created prevents from 
> using HasOffsetRanges interface later. This means that whole message must be 
> propagated to the end of a pipeline, which is very ineffective.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API

2019-07-17 Thread Michael Spector (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886906#comment-16886906
 ] 

Michael Spector commented on SPARK-28415:
-

[~gsomogyi] Can you tell that Kafka 0.8 API will be supported forever, or it 
will be deprecated at some point? If the latter is true, then this basic 
functionality must be preserved even though the API is different.

> Add messageHandler to Kafka 10 direct stream API
> 
>
> Key: SPARK-28415
> URL: https://issues.apache.org/jira/browse/SPARK-28415
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.4.3
>Reporter: Michael Spector
>Priority: Major
>
> Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new 
> Kafka API is what prevents us from upgrading our processes to use it, and 
> here's why:
>  # messageHandler() allowed parsing / filtering / projecting huge JSON files 
> at an early stage (only a small subset of JSON fields is required for a 
> process), without this current cluster configuration doesn't keep up with the 
> traffic.
>  # Transforming Kafka events right after a stream is created prevents from 
> using HasOffsetRanges interface later. This means that whole message must be 
> propagated to the end of a pipeline, which is very ineffective.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API

2019-07-17 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886905#comment-16886905
 ] 

Gabor Somogyi commented on SPARK-28415:
---

I don't see the reason why different API should behave the same way.

> Add messageHandler to Kafka 10 direct stream API
> 
>
> Key: SPARK-28415
> URL: https://issues.apache.org/jira/browse/SPARK-28415
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.4.3
>Reporter: Michael Spector
>Priority: Major
>
> Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new 
> Kafka API is what prevents us from upgrading our processes to use it, and 
> here's why:
>  # messageHandler() allowed parsing / filtering / projecting huge JSON files 
> at an early stage (only a small subset of JSON fields is required for a 
> process), without this current cluster configuration doesn't keep up with the 
> traffic.
>  # Transforming Kafka events right after a stream is created prevents from 
> using HasOffsetRanges interface later. This means that whole message must be 
> propagated to the end of a pipeline, which is very ineffective.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28420) Date/Time Functions: date_part

2019-07-17 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-28420:
---

 Summary: Date/Time Functions: date_part
 Key: SPARK-28420
 URL: https://issues.apache.org/jira/browse/SPARK-28420
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


||Function||Return Type||Description||Example||Result||
|{{date_part(}}{{text}}{{, }}{{timestamp}}{{)}}|{{double precision}}|Get 
subfield (equivalent to {{extract}}); see [Section 
9.9.1|https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT]|{{date_part('hour',
 timestamp '2001-02-16 20:38:40')}}|{{20}}|
|{{date_part(}}{{text}}{{, }}{{interval}}{{)}}|{{double precision}}|Get 
subfield (equivalent to {{extract}}); see [Section 
9.9.1|https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT]|{{date_part('month',
 interval '2 years 3 months')}}|{{3}}|

We can replace it with {{extract(field from timestamp)}}.

https://www.postgresql.org/docs/11/functions-datetime.html





--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API

2019-07-17 Thread Michael Spector (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886900#comment-16886900
 ] 

Michael Spector commented on SPARK-28415:
-

Well, it depends what's your perspective on the issue. If it's a regression 
(and this is a regression from our perspective, since we're unable to upgrade 
to the new API) - then it's a bug.



> Add messageHandler to Kafka 10 direct stream API
> 
>
> Key: SPARK-28415
> URL: https://issues.apache.org/jira/browse/SPARK-28415
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.4.3
>Reporter: Michael Spector
>Priority: Major
>
> Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new 
> Kafka API is what prevents us from upgrading our processes to use it, and 
> here's why:
>  # messageHandler() allowed parsing / filtering / projecting huge JSON files 
> at an early stage (only a small subset of JSON fields is required for a 
> process), without this current cluster configuration doesn't keep up with the 
> traffic.
>  # Transforming Kafka events right after a stream is created prevents from 
> using HasOffsetRanges interface later. This means that whole message must be 
> propagated to the end of a pipeline, which is very ineffective.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API

2019-07-17 Thread Gabor Somogyi (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-28415:
--
Issue Type: New Feature  (was: Bug)

> Add messageHandler to Kafka 10 direct stream API
> 
>
> Key: SPARK-28415
> URL: https://issues.apache.org/jira/browse/SPARK-28415
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Michael Spector
>Priority: Major
>
> Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new 
> Kafka API is what prevents us from upgrading our processes to use it, and 
> here's why:
>  # messageHandler() allowed parsing / filtering / projecting huge JSON files 
> at an early stage (only a small subset of JSON fields is required for a 
> process), without this current cluster configuration doesn't keep up with the 
> traffic.
>  # Transforming Kafka events right after a stream is created prevents from 
> using HasOffsetRanges interface later. This means that whole message must be 
> propagated to the end of a pipeline, which is very ineffective.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API

2019-07-17 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886898#comment-16886898
 ] 

Gabor Somogyi commented on SPARK-28415:
---

This is more like a new feature than a bug so modified.

> Add messageHandler to Kafka 10 direct stream API
> 
>
> Key: SPARK-28415
> URL: https://issues.apache.org/jira/browse/SPARK-28415
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Michael Spector
>Priority: Major
>
> Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new 
> Kafka API is what prevents us from upgrading our processes to use it, and 
> here's why:
>  # messageHandler() allowed parsing / filtering / projecting huge JSON files 
> at an early stage (only a small subset of JSON fields is required for a 
> process), without this current cluster configuration doesn't keep up with the 
> traffic.
>  # Transforming Kafka events right after a stream is created prevents from 
> using HasOffsetRanges interface later. This means that whole message must be 
> propagated to the end of a pipeline, which is very ineffective.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API

2019-07-17 Thread Gabor Somogyi (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-28415:
--
Component/s: (was: Structured Streaming)
 DStreams

> Add messageHandler to Kafka 10 direct stream API
> 
>
> Key: SPARK-28415
> URL: https://issues.apache.org/jira/browse/SPARK-28415
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.4.3
>Reporter: Michael Spector
>Priority: Major
>
> Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new 
> Kafka API is what prevents us from upgrading our processes to use it, and 
> here's why:
>  # messageHandler() allowed parsing / filtering / projecting huge JSON files 
> at an early stage (only a small subset of JSON fields is required for a 
> process), without this current cluster configuration doesn't keep up with the 
> traffic.
>  # Transforming Kafka events right after a stream is created prevents from 
> using HasOffsetRanges interface later. This means that whole message must be 
> propagated to the end of a pipeline, which is very ineffective.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28419) A patch for SparkThriftServer support multi-tenant authentication

2019-07-17 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28419:

Affects Version/s: (was: 2.4.0)
   3.0.0

> A patch for SparkThriftServer support multi-tenant authentication
> -
>
> Key: SPARK-28419
> URL: https://issues.apache.org/jira/browse/SPARK-28419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28419) A patch for SparkThriftServer support multi-tenant authentication

2019-07-17 Thread angerszhu (JIRA)

angerszhu created SPARK-28419:
-

 Summary: A patch for SparkThriftServer support multi-tenant 
authentication
 Key: SPARK-28419
 URL: https://issues.apache.org/jira/browse/SPARK-28419
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: angerszhu






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28363) Enable run test with clover

2019-07-17 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28363.
--
Resolution: Incomplete

No feedback.

> Enable run test with clover
> ---
>
> Key: SPARK-28363
> URL: https://issues.apache.org/jira/browse/SPARK-28363
> Project: Spark
>  Issue Type: Task
>  Components: Build, Tests
>Affects Versions: 2.3.0
>Reporter: luhuachao
>Priority: Major
>
> Now ,compilation error occured when run test with clover as the reason 
> java-scala cross compile in spark. refer to 
> [https://confluence.atlassian.com/cloverkb/java-scala-cross-compilation-error-cannot-find-symbol-765593874.html];
> Do we need to modify the pom.xml to support clover in spark ?
> h1.  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28364) Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories.

2019-07-17 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-28364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28364.
--
Resolution: Incomplete

no feedback.

> Unable to read complete data from an external hive table stored as ORC that 
> points to a managed table's data files which is getting stored in 
> sub-directories.
> --
>
> Key: SPARK-28364
> URL: https://issues.apache.org/jira/browse/SPARK-28364
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Debdut Mukherjee
>Priority: Major
> Attachments: pic.PNG
>
>
> Unable to read complete data from an external hive table stored as ORC that 
> points to a managed table's data files (ORC) which is getting stored in 
> sub-directories.
> The count also does not match unless the path is given with a *.
> *Example This works:-*
> "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/*"  
> But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake 
> Store) location
>  
> The below one does not work when a SELECT COUNT ( * ) is executed on this 
> external table. It gives partial count.
> {code}
> CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 (
> Col_1 string,
> Col_2 string
> STORED AS ORC
> LOCATION "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/"
> )
> {code}
>  
> I was looking for a resolution in google, and even adding below lines to the 
> Databricks Notebook did not solve the problem.
> {code}
> sqlContext.setConf("mapred.input.dir.recursive","true");
> sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27884) Deprecate Python 2 support in Spark 3.0

2019-07-17 Thread xifeng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886796#comment-16886796
 ] 

xifeng commented on SPARK-27884:


looks fine

> Deprecate Python 2 support in Spark 3.0
> ---
>
> Key: SPARK-27884
> URL: https://issues.apache.org/jira/browse/SPARK-27884
> Project: Spark
>  Issue Type: Story
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: release-notes
>
> Officially deprecate Python 2 support in Spark 3.0.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19256) Hive bucketing support

2019-07-17 Thread Aditya Prakash (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886768#comment-16886768
 ] 

Aditya Prakash commented on SPARK-19256:


[~chengsu] any update on this?

> Hive bucketing support
> --
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28418) Flaky Test: pyspark.sql.tests.test_dataframe: test_query_execution_listener_on_collect

2019-07-17 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-28418:


 Summary: Flaky Test: pyspark.sql.tests.test_dataframe: 
test_query_execution_listener_on_collect
 Key: SPARK-28418
 URL: https://issues.apache.org/jira/browse/SPARK-28418
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


{code}
ERROR [0.164s]: test_query_execution_listener_on_collect 
(pyspark.sql.tests.test_dataframe.QueryExecutionListenerTests)
--
Traceback (most recent call last):
  File "/home/jenkins/python/pyspark/sql/tests/test_dataframe.py", line 758, in 
test_query_execution_listener_on_collect
"The callback from the query execution listener should be called after 
'collect'")
AssertionError: The callback from the query execution listener should be called 
after 'collect'
{code}

Seems it can be failed due to not waiting events to be proceeded.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

69 matches

Mail list logo