[jira] [Updated] (SPARK-28429) SQL Datetime util function being casted to double instead of timestamp
[ https://issues.apache.org/jira/browse/SPARK-28429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28429: Component/s: (was: Tests) > SQL Datetime util function being casted to double instead of timestamp > -- > > Key: SPARK-28429 > URL: https://issues.apache.org/jira/browse/SPARK-28429 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > > In the code below, 'now()+'100 days' are casted to double and then an error > is thrown: > {code:sql} > CREATE TEMP VIEW v_window AS > SELECT i, min(i) over (order by i range between '1 day' preceding and '10 > days' following) as min_i > FROM range(now(), now()+'100 days', '1 hour') i; > {code} > Error: > {code:sql} > cannot resolve '(current_timestamp() + CAST('100 days' AS DOUBLE))' due to > data type mismatch: differing types in '(current_timestamp() + CAST('100 > days' AS DOUBLE))' (timestamp and double).;{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28411) insertInto with overwrite inconsistent behaviour Python/Scala
[ https://issues.apache.org/jira/browse/SPARK-28411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28411: Assignee: Huaxin Gao > insertInto with overwrite inconsistent behaviour Python/Scala > - > > Key: SPARK-28411 > URL: https://issues.apache.org/jira/browse/SPARK-28411 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.1, 2.4.0 >Reporter: Maria Rebelka >Assignee: Huaxin Gao >Priority: Minor > > The df.write.mode("overwrite").insertInto("table") has inconsistent behaviour > between Scala and Python. In Python, insertInto ignores "mode" parameter and > appends by default. Only when changing syntax to df.write.insertInto("table", > overwrite=True) we get expected behaviour. > This is a native Spark syntax, expected to be the same between languages... > Also, in other write methods, like saveAsTable or write.parquet "mode" seem > to be respected. > Reproduce, Python, ignore "overwrite": > {code:java} > df = spark.createDataFrame(sc.parallelize([(1, 2),(3,4)]),['i','j']) > # create the table and load data > df.write.saveAsTable("spark_overwrite_issue") > # insert overwrite, expected result - 2 rows > df.write.mode("overwrite").insertInto("spark_overwrite_issue") > spark.sql("select * from spark_overwrite_issue").count() > # result - 4 rows, insert appended data instead of overwrite{code} > Reproduce, Scala, works as expected: > {code:java} > val df = Seq((1, 2),(3,4)).toDF("i","j") > df.write.mode("overwrite").insertInto("spark_overwrite_issue") > spark.sql("select * from spark_overwrite_issue").count() > # result - 2 rows{code} > Tested on Spark 2.2.1 (EMR) and 2.4.0 (Databricks) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28411) insertInto with overwrite inconsistent behaviour Python/Scala
[ https://issues.apache.org/jira/browse/SPARK-28411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28411. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25175 [https://github.com/apache/spark/pull/25175] > insertInto with overwrite inconsistent behaviour Python/Scala > - > > Key: SPARK-28411 > URL: https://issues.apache.org/jira/browse/SPARK-28411 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.1, 2.4.0 >Reporter: Maria Rebelka >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > The df.write.mode("overwrite").insertInto("table") has inconsistent behaviour > between Scala and Python. In Python, insertInto ignores "mode" parameter and > appends by default. Only when changing syntax to df.write.insertInto("table", > overwrite=True) we get expected behaviour. > This is a native Spark syntax, expected to be the same between languages... > Also, in other write methods, like saveAsTable or write.parquet "mode" seem > to be respected. > Reproduce, Python, ignore "overwrite": > {code:java} > df = spark.createDataFrame(sc.parallelize([(1, 2),(3,4)]),['i','j']) > # create the table and load data > df.write.saveAsTable("spark_overwrite_issue") > # insert overwrite, expected result - 2 rows > df.write.mode("overwrite").insertInto("spark_overwrite_issue") > spark.sql("select * from spark_overwrite_issue").count() > # result - 4 rows, insert appended data instead of overwrite{code} > Reproduce, Scala, works as expected: > {code:java} > val df = Seq((1, 2),(3,4)).toDF("i","j") > df.write.mode("overwrite").insertInto("spark_overwrite_issue") > spark.sql("select * from spark_overwrite_issue").count() > # result - 2 rows{code} > Tested on Spark 2.2.1 (EMR) and 2.4.0 (Databricks) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27609) from_json expects values of options dictionary to be
[ https://issues.apache.org/jira/browse/SPARK-27609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27609. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25182 [https://github.com/apache/spark/pull/25182] > from_json expects values of options dictionary to be > - > > Key: SPARK-27609 > URL: https://issues.apache.org/jira/browse/SPARK-27609 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 > Environment: I've found this issue on an AWS Glue development > endpoint which is running Spark 2.2.1 and being given jobs through a > SparkMagic Python 2 kernel, running through livy and all that. I don't know > how much of that is important for reproduction, and can get more details if > needed. >Reporter: Zachary Jablons >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > When reading a column of a DataFrame that consists of serialized JSON, one of > the options for inferring the schema and then parsing the JSON is to do a two > step process consisting of: > > {code} > # this results in a new dataframe where the top-level keys of the JSON # are > columns > df_parsed_direct = spark.read.json(df.rdd.map(lambda row: row.json_col)) > # this does that while preserving the rest of df > schema = df_parsed_direct.schema > df_parsed = df.withColumn('parsed', from_json(df.json_col, schema) > {code} > When I do this, I sometimes find myself passing in options. My understanding > is, from the documentation > [here|http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json], > that the nature of these options passed should be the same whether I do > {code} > spark.read.option('option',value) > {code} > or > {code} > from_json(df.json_col, schema, options={'option':value}) > {code} > > However, I've found that the latter expects value to be a string > representation of the value that can be decoded by JSON. So, for example > options=\{'multiLine':True} fails with > {code} > java.lang.ClassCastException: java.lang.Boolean cannot be cast to > java.lang.String > {code} > whereas {{options=\{'multiLine':'true'}}} works just fine. > Notably, providing {{spark.read.option('multiLine',True)}} works fine! > The code for reproducing this issue as well as the stacktrace from hitting it > are provided in [this > gist|https://gist.github.com/zmjjmz/0af5cf9b059b4969951e825565e266aa]. > I also noticed that from_json doesn't complain if you give it a garbage > option key – but that seems separate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27609) from_json expects values of options dictionary to be
[ https://issues.apache.org/jira/browse/SPARK-27609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-27609: Assignee: Maxim Gekk > from_json expects values of options dictionary to be > - > > Key: SPARK-27609 > URL: https://issues.apache.org/jira/browse/SPARK-27609 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.1 > Environment: I've found this issue on an AWS Glue development > endpoint which is running Spark 2.2.1 and being given jobs through a > SparkMagic Python 2 kernel, running through livy and all that. I don't know > how much of that is important for reproduction, and can get more details if > needed. >Reporter: Zachary Jablons >Assignee: Maxim Gekk >Priority: Minor > > When reading a column of a DataFrame that consists of serialized JSON, one of > the options for inferring the schema and then parsing the JSON is to do a two > step process consisting of: > > {code} > # this results in a new dataframe where the top-level keys of the JSON # are > columns > df_parsed_direct = spark.read.json(df.rdd.map(lambda row: row.json_col)) > # this does that while preserving the rest of df > schema = df_parsed_direct.schema > df_parsed = df.withColumn('parsed', from_json(df.json_col, schema) > {code} > When I do this, I sometimes find myself passing in options. My understanding > is, from the documentation > [here|http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json], > that the nature of these options passed should be the same whether I do > {code} > spark.read.option('option',value) > {code} > or > {code} > from_json(df.json_col, schema, options={'option':value}) > {code} > > However, I've found that the latter expects value to be a string > representation of the value that can be decoded by JSON. So, for example > options=\{'multiLine':True} fails with > {code} > java.lang.ClassCastException: java.lang.Boolean cannot be cast to > java.lang.String > {code} > whereas {{options=\{'multiLine':'true'}}} works just fine. > Notably, providing {{spark.read.option('multiLine',True)}} works fine! > The code for reproducing this issue as well as the stacktrace from hitting it > are provided in [this > gist|https://gist.github.com/zmjjmz/0af5cf9b059b4969951e825565e266aa]. > I also noticed that from_json doesn't complain if you give it a garbage > option key – but that seems separate. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28434) Decision Tree model isn't equal after save and load
[ https://issues.apache.org/jira/browse/SPARK-28434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ievgen Prokhorenko updated SPARK-28434: --- Description: The file `mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala` on the line no. 628 has a TODO saying: {code:java} // TODO: Check other fields besides the information gain. {code} If, in addition to the existing check of InformationGainStats' gain value I add another check, for instance, impurity – the test fails because the values are different in the saved model and the one restored from disk. See PR with an example. The tests are executed with this command: {code:java} build/mvn -e -Dtest=none -DwildcardSuites=org.apache.spark.mllib.tree.DecisionTreeSuite test{code} Excerpts from the output of the command above: {code:java} ... - model save/load *** FAILED *** checkEqual failed since the two trees were not identical. TREE A: DecisionTreeModel classifier of depth 2 with 5 nodes If (feature 0 <= 0.5) Predict: 0.0 Else (feature 0 > 0.5) If (feature 1 in {0.0,1.0}) Predict: 0.0 Else (feature 1 not in {0.0,1.0}) Predict: 0.0 TREE B: DecisionTreeModel classifier of depth 2 with 5 nodes If (feature 0 <= 0.5) Predict: 0.0 Else (feature 0 > 0.5) If (feature 1 in {0.0,1.0}) Predict: 0.0 Else (feature 1 not in {0.0,1.0}) Predict: 0.0 (DecisionTreeSuite.scala:610) ...{code} If I add a little debug info in the `DecisionTreeSuite.checkEqual`: {code:java} val aStats = a.stats val bStats = b.stats println(s"id ${a.id} ${b.id}") println(s"impurity ${aStats.get.impurity} ${bStats.get.impurity}") println(s"leftImpurity ${aStats.get.leftImpurity} ${bStats.get.leftImpurity}") println(s"rightImpurity ${aStats.get.rightImpurity} ${bStats.get.rightImpurity}") println(s"leftPredict ${aStats.get.leftPredict} ${bStats.get.leftPredict}") println(s"rightPredict ${aStats.get.rightPredict} ${bStats.get.rightPredict}") println(s"gain ${aStats.get.gain} ${bStats.get.gain}") {code} Then, in the output of the test command we can see that only values of `gain` are equal: {code:java} id 1 1 impurity 0.2 0.5 leftImpurity 0.3 0.5 rightImpurity 0.4 0.5 leftPredict 1.0 (prob = 0.4) 0.0 (prob = 1.0) rightPredict 0.0 (prob = 0.6) 0.0 (prob = 1.0) gain 0.1 0.1 {code} was: The file `mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala` on the line no. 628 has a TODO saying: {code:java} // TODO: Check other fields besides the information gain. {code} If, in addition to the existing check of InformationGainStats' gain value I add another check, for instance, impurity – the test fails because the values are different in the saved model and the one restored from disk. See PR with an example. The tests are executed with this command: {code:java} build/mvn -e -Dtest=none -DwildcardSuites=org.apache.spark.mllib.tree.DecisionTreeSuite test{code} > Decision Tree model isn't equal after save and load > --- > > Key: SPARK-28434 > URL: https://issues.apache.org/jira/browse/SPARK-28434 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.4.3 > Environment: spark from master >Reporter: Ievgen Prokhorenko >Priority: Major > > The file > `mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala` on > the line no. 628 has a TODO saying: > > {code:java} > // TODO: Check other fields besides the information gain. > {code} > If, in addition to the existing check of InformationGainStats' gain value I > add another check, for instance, impurity – the test fails because the values > are different in the saved model and the one restored from disk. > > See PR with an example. > > The tests are executed with this command: > > {code:java} > build/mvn -e -Dtest=none > -DwildcardSuites=org.apache.spark.mllib.tree.DecisionTreeSuite test{code} > > Excerpts from the output of the command above: > {code:java} > ... > - model save/load *** FAILED *** > checkEqual failed since the two trees were not identical. > TREE A: > DecisionTreeModel classifier of depth 2 with 5 nodes > If (feature 0 <= 0.5) > Predict: 0.0 > Else (feature 0 > 0.5) > If (feature 1 in {0.0,1.0}) > Predict: 0.0 > Else (feature 1 not in {0.0,1.0}) > Predict: 0.0 > TREE B: > DecisionTreeModel classifier of depth 2 with 5 nodes > If (feature 0 <= 0.5) > Predict: 0.0 > Else (feature 0 > 0.5) > If (feature 1 in {0.0,1.0}) > Predict: 0.0 > Else (feature 1 not in {0.0,1.0}) > Predict: 0.0 (DecisionTreeSuite.scala:610) > ...{code} > If I add a little debug info in the `DecisionTreeSuite.checkEqual`: > > {code:java} > val aStats = a.stats > val bStats = b.stats > println(s"id ${a.id} ${b.id}") > println(s"impurity ${aStats.get.impurity} ${bStats.get.impurity}") > println(s"leftImpurity
[jira] [Created] (SPARK-28434) Decision Tree model isn't equal after save and load
Ievgen Prokhorenko created SPARK-28434: -- Summary: Decision Tree model isn't equal after save and load Key: SPARK-28434 URL: https://issues.apache.org/jira/browse/SPARK-28434 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.4.3 Environment: spark from master Reporter: Ievgen Prokhorenko The file `mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala` on the line no. 628 has a TODO saying: {code:java} // TODO: Check other fields besides the information gain. {code} If, in addition to the existing check of InformationGainStats' gain value I add another check, for instance, impurity – the test fails because the values are different in the saved model and the one restored from disk. See PR with an example. The tests are executed with this command: {code:java} build/mvn -e -Dtest=none -DwildcardSuites=org.apache.spark.mllib.tree.DecisionTreeSuite test{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28433) Incorrect assertion in scala test for aarch64 platform
huangtianhua created SPARK-28433: Summary: Incorrect assertion in scala test for aarch64 platform Key: SPARK-28433 URL: https://issues.apache.org/jira/browse/SPARK-28433 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3, 3.0.0 Reporter: huangtianhua We ran unit tests of spark on aarch64 server, here are two sql scala tests failed: - SPARK-26021: NaN and -0.0 in grouping expressions *** FAILED *** 2143289344 equaled 2143289344 (DataFrameAggregateSuite.scala:732) - NaN and -0.0 in window partition keys *** FAILED *** 2143289344 equaled 2143289344 (DataFrameWindowFunctionsSuite.scala:704) we found the values of floatToRawIntBits(0.0f / 0.0f) and floatToRawIntBits(Float.NaN) on aarch64 are same(2143289344), first we thought it's something about jdk or scala, but after discuss with jdk-dev and scala community see https://users.scala-lang.org/t/the-value-of-floattorawintbits-0-0f-0-0f-is-different-on-x86-64-and-aarch64-platforms/4845 , we believe the value depends on the architecture. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10816) EventTime based sessionization
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887561#comment-16887561 ] Chang chen commented on SPARK-10816: Hi guys Any updates on this issue? > EventTime based sessionization > -- > > Key: SPARK-10816 > URL: https://issues.apache.org/jira/browse/SPARK-10816 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-10816 Support session window natively.pdf, Session > Window Support For Structure Streaming.pdf > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-28293) Implement Spark's own GetTableTypesOperation
[ https://issues.apache.org/jira/browse/SPARK-28293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28293: Comment: was deleted (was: I'm working on) > Implement Spark's own GetTableTypesOperation > > > Key: SPARK-28293 > URL: https://issues.apache.org/jira/browse/SPARK-28293 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: Hive-1.2.1.png, Hive-2.3.5.png > > > Build with Hive 1.2.1: > !Hive-1.2.1.png! > Build with Hive 2.3.5: > !Hive-2.3.5.png! -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28432) Date/Time Functions: make_date/make_timestamp
Yuming Wang created SPARK-28432: --- Summary: Date/Time Functions: make_date/make_timestamp Key: SPARK-28432 URL: https://issues.apache.org/jira/browse/SPARK-28432 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang ||Function||Return Type||Description||Example||Result|| |{{make_date(_year_ }}{{int}}{{, _month_ }}{{int}}{{, _day_ }}{{int}}{{)}}|{{date}}|Create date from year, month and day fields|{{make_date(2013, 7, 15)}}|{{2013-07-15}}| |{{make_timestamp(_year_ }}{{int}}{{, _month_ }}{{int}}{{, _day_ }}{{int}}{{, _hour_ }}{{int}}{{, _min_ }}{{int}}{{, _sec_}}{{double precision}}{{)}}|{{timestamp}}|Create timestamp from year, month, day, hour, minute and seconds fields|{{make_timestamp(2013, 7, 15, 8, 15, 23.5)}}|{{2013-07-15 08:15:23.5}}| https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28293) Implement Spark's own GetTableTypesOperation
[ https://issues.apache.org/jira/browse/SPARK-28293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28293: Issue Type: Sub-task (was: Improvement) Parent: SPARK-28426 > Implement Spark's own GetTableTypesOperation > > > Key: SPARK-28293 > URL: https://issues.apache.org/jira/browse/SPARK-28293 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > Attachments: Hive-1.2.1.png, Hive-2.3.5.png > > > Build with Hive 1.2.1: > !Hive-1.2.1.png! > Build with Hive 2.3.5: > !Hive-2.3.5.png! -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28167) Show global temporary view in database tool
[ https://issues.apache.org/jira/browse/SPARK-28167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28167: Issue Type: Sub-task (was: Improvement) Parent: SPARK-28426 > Show global temporary view in database tool > --- > > Key: SPARK-28167 > URL: https://issues.apache.org/jira/browse/SPARK-28167 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28184) Avoid creating new sessions in SparkMetadataOperationSuite
[ https://issues.apache.org/jira/browse/SPARK-28184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28184: Issue Type: Sub-task (was: Improvement) Parent: SPARK-28426 > Avoid creating new sessions in SparkMetadataOperationSuite > -- > > Key: SPARK-28184 > URL: https://issues.apache.org/jira/browse/SPARK-28184 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28431) CSV datasource throw com.univocity.parsers.common.TextParsingException with large size message
Weichen Xu created SPARK-28431: -- Summary: CSV datasource throw com.univocity.parsers.common.TextParsingException with large size message Key: SPARK-28431 URL: https://issues.apache.org/jira/browse/SPARK-28431 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Weichen Xu CSV datasource throw com.univocity.parsers.common.TextParsingException with large size message, which will make log output consume large disk space. Reproduce code {code:java} val s = "a" * 40 * 100 Seq(s).toDF.write.mode("overwrite").csv("/tmp/bogdan/es4196.csv") spark.read .option("maxCharsPerColumn", 3000) .csv("/tmp/bogdan/es4196.csv").count{code} Because of maxCharsPerColumn limit of 30M, there will be a TextParsingException. The message of this exception actually includes what was parsed so far, in this case 30M chars. This issue is troublesome when sometimes we need parse CSV with large column. We should truncate the large size message in the TextParsingException. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27570) java.io.EOFException Reached the end of stream - Reading Parquet from Swift
[ https://issues.apache.org/jira/browse/SPARK-27570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887498#comment-16887498 ] Josh Rosen edited comment on SPARK-27570 at 7/18/19 12:28 AM: -- [~ste...@apache.org], I finally got a chance to test your {{fadvise}} configuration recommendation and that resolved my issue. *However*, I think that there's a typo in your recommendation: this only worked when I used {{fs.s3a.experimental.*input*.fadvise}} (the {{.input}} was missing in your comment). *Update*: filed HADOOP-16437 to fix the documentation typo. was (Author: joshrosen): [~ste...@apache.org], I finally got a chance to test your {{fadvise}} configuration recommendation and that resolved my issue. *However*, I think that there's a typo in your recommendation: this only worked when I used {{fs.s3a.experimental.*input*.fadvise}} (the {{.input}} was missing in your comment). > java.io.EOFException Reached the end of stream - Reading Parquet from Swift > --- > > Key: SPARK-27570 > URL: https://issues.apache.org/jira/browse/SPARK-27570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Harry Hough >Priority: Major > > I did see issue SPARK-25966 but it seems there are some differences as his > problem was resolved after rebuilding the parquet files on write. This is > 100% reproducible for me across many different days of data. > I get exceptions such as "Reached the end of stream with 750477 bytes left to > read" during some read operations of parquet files. I am reading these files > from Openstack swift using openstack-hadoop 2.7.7 on Spark 2.4. > The issues seem to happen with the where statement. I have also tried filter > and combining the statements into one as well as the dataset method with > column without any luck. Which column or what the actual filter is on the > where also doesn't seem to make a difference to the error occurring or not. > > {code:java} > val engagementDS = spark > .read > .parquet(createSwiftAddr("engagements", folder)) > .where("engtype != 0") > .where("engtype != 1000") > .groupBy($"accid", $"sessionkey") > .agg(collect_list(struct($"time", $"pid", $"engtype", $"pageid", > $"testid")).as("engagements")) > // Exiting paste mode, now interpreting. > [Stage 53:> (0 + 32) / 32]2019-04-25 19:02:12 ERROR Executor:91 - Exception > in task 24.0 in stage 53.0 (TID 688) > java.io.EOFException: Reached the end of stream with 1323959 bytes left to > read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:107) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105) > at >
[jira] [Commented] (SPARK-14543) SQL/Hive insertInto has unexpected results
[ https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887508#comment-16887508 ] Alexander Tronchin-James commented on SPARK-14543: -- Did the calling syntax change for this? I'm using 2.4.x and can't find anything about .byName on writers in the docs, but maybe I'm just bad at searching the docs... > SQL/Hive insertInto has unexpected results > -- > > Key: SPARK-14543 > URL: https://issues.apache.org/jira/browse/SPARK-14543 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > > *Updated description* > There should be an option to match input data to output columns by name. The > API allows operations on tables, which hide the column resolution problem. > It's easy to copy from one table to another without listing the columns, and > in the API it is common to work with columns by name rather than by position. > I think the API should add a way to match columns by name, which is closer to > what users expect. I propose adding something like this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} > *Original description* > The Hive write path adds a pre-insertion cast (projection) to reconcile > incoming data columns with the outgoing table schema. Columns are matched by > position and casts are inserted to reconcile the two column schemas. > When columns aren't correctly aligned, this causes unexpected results. I ran > into this by not using a correct {{partitionBy}} call (addressed by > SPARK-14459), which caused an error message that an int could not be cast to > an array. However, if the columns are vaguely compatible, for example string > and float, then no error or warning is produced and data is written to the > wrong columns using unexpected casts (string -> bigint -> float). > A real-world use case that will hit this is when a table definition changes > by adding a column in the middle of a table. Spark SQL statements that copied > from that table to a destination table will then map the columns differently > but insert casts that mask the problem. The last column's data will be > dropped without a reliable warning for the user. > This highlights a few problems: > * Too many or too few incoming data columns should cause an AnalysisException > to be thrown > * Only "safe" casts should be inserted automatically, like int -> long, using > UpCast > * Pre-insertion casts currently ignore extra columns by using zip > * The pre-insertion cast logic differs between Hive's MetastoreRelation and > LogicalRelation > Also, I think there should be an option to match input data to output columns > by name. The API allows operations on tables, which hide the column > resolution problem. It's easy to copy from one table to another without > listing the columns, and in the API it is common to work with columns by name > rather than by position. I think the API should add a way to match columns by > name, which is closer to what users expect. I propose adding something like > this: > {code} > CREATE TABLE src (id: bigint, count: int, total: bigint) > CREATE TABLE dst (id: bigint, total: bigint, count: int) > sqlContext.table("src").write.byName.insertInto("dst") > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28430) Some stage table rows render wrong number of columns if tasks are missing metrics
[ https://issues.apache.org/jira/browse/SPARK-28430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-28430: --- Attachment: ui-screenshot.png > Some stage table rows render wrong number of columns if tasks are missing > metrics > -- > > Key: SPARK-28430 > URL: https://issues.apache.org/jira/browse/SPARK-28430 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0, 3.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Attachments: ui-screenshot.png > > > The Spark UI's stages table renders too few columns for some tasks if a > subset of the tasks are missing their metrics. This is due to an > inconsistency in how we render certain columns: some columns gracefully > handle this case, but others do not. See attached screenshot below -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28430) Some stage table rows render wrong number of columns if tasks are missing metrics
[ https://issues.apache.org/jira/browse/SPARK-28430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-28430: -- Assignee: Josh Rosen > Some stage table rows render wrong number of columns if tasks are missing > metrics > -- > > Key: SPARK-28430 > URL: https://issues.apache.org/jira/browse/SPARK-28430 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0, 3.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Attachments: ui-screenshot.png > > > The Spark UI's stages table renders too few columns for some tasks if a > subset of the tasks are missing their metrics. This is due to an > inconsistency in how we render certain columns: some columns gracefully > handle this case, but others do not. See attached screenshot below -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28430) Some stage table rows render wrong number of columns if tasks are missing metrics
[ https://issues.apache.org/jira/browse/SPARK-28430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-28430: --- Description: The Spark UI's stages table renders too few columns for some tasks if a subset of the tasks are missing their metrics. This is due to an inconsistency in how we render certain columns: some columns gracefully handle this case, but others do not. See attached screenshot below !ui-screenshot.png! was:The Spark UI's stages table renders too few columns for some tasks if a subset of the tasks are missing their metrics. This is due to an inconsistency in how we render certain columns: some columns gracefully handle this case, but others do not. See attached screenshot below > Some stage table rows render wrong number of columns if tasks are missing > metrics > -- > > Key: SPARK-28430 > URL: https://issues.apache.org/jira/browse/SPARK-28430 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0, 3.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Attachments: ui-screenshot.png > > > The Spark UI's stages table renders too few columns for some tasks if a > subset of the tasks are missing their metrics. This is due to an > inconsistency in how we render certain columns: some columns gracefully > handle this case, but others do not. See attached screenshot below > !ui-screenshot.png! -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28430) Some stage table rows render wrong number of columns if tasks are missing metrics
Josh Rosen created SPARK-28430: -- Summary: Some stage table rows render wrong number of columns if tasks are missing metrics Key: SPARK-28430 URL: https://issues.apache.org/jira/browse/SPARK-28430 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.4.0, 3.0.0 Reporter: Josh Rosen The Spark UI's stages table renders too few columns for some tasks if a subset of the tasks are missing their metrics. This is due to an inconsistency in how we render certain columns: some columns gracefully handle this case, but others do not. See attached screenshot below -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27570) java.io.EOFException Reached the end of stream - Reading Parquet from Swift
[ https://issues.apache.org/jira/browse/SPARK-27570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887498#comment-16887498 ] Josh Rosen commented on SPARK-27570: [~ste...@apache.org], I finally got a chance to test your {{fadvise}} configuration recommendation and that resolved my issue. *However*, I think that there's a typo in your recommendation: this only worked when I used {{fs.s3a.experimental.*input*.fadvise}} (the {{.input}} was missing in your comment). > java.io.EOFException Reached the end of stream - Reading Parquet from Swift > --- > > Key: SPARK-27570 > URL: https://issues.apache.org/jira/browse/SPARK-27570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Harry Hough >Priority: Major > > I did see issue SPARK-25966 but it seems there are some differences as his > problem was resolved after rebuilding the parquet files on write. This is > 100% reproducible for me across many different days of data. > I get exceptions such as "Reached the end of stream with 750477 bytes left to > read" during some read operations of parquet files. I am reading these files > from Openstack swift using openstack-hadoop 2.7.7 on Spark 2.4. > The issues seem to happen with the where statement. I have also tried filter > and combining the statements into one as well as the dataset method with > column without any luck. Which column or what the actual filter is on the > where also doesn't seem to make a difference to the error occurring or not. > > {code:java} > val engagementDS = spark > .read > .parquet(createSwiftAddr("engagements", folder)) > .where("engtype != 0") > .where("engtype != 1000") > .groupBy($"accid", $"sessionkey") > .agg(collect_list(struct($"time", $"pid", $"engtype", $"pageid", > $"testid")).as("engagements")) > // Exiting paste mode, now interpreting. > [Stage 53:> (0 + 32) / 32]2019-04-25 19:02:12 ERROR Executor:91 - Exception > in task 24.0 in stage 53.0 (TID 688) > java.io.EOFException: Reached the end of stream with 1323959 bytes left to > read > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127) > at > org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91) > at > org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:107) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at
[jira] [Resolved] (SPARK-28417) Spark Submit does not use Proxy User Credentials to Resolve Path for Resources
[ https://issues.apache.org/jira/browse/SPARK-28417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-28417. Resolution: Duplicate > Spark Submit does not use Proxy User Credentials to Resolve Path for Resources > -- > > Key: SPARK-28417 > URL: https://issues.apache.org/jira/browse/SPARK-28417 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, > 2.4.0, 2.4.1, 2.4.2, 2.4.3 >Reporter: Abhishek Modi >Priority: Minor > > Starting in as of [#SPARK-21012], spark-submit supports wildcard paths (ex: > {{hdfs:///user/akmodi/*}}). To support these, spark-submit does a glob > resolution on these paths and overwrites the wildcard paths with the resolved > paths. This introduced a bug - the change did not use {{proxy-user}} > credentials when resolving these paths. As a result, Spark 2.2 and later apps > fail to launch an app as a {{proxy-user}} if the paths are only readable by > the {{proxy-user}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28097) Map ByteType to SMALLINT when using JDBC with PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-28097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-28097: - Assignee: Seth Fitzsimmons > Map ByteType to SMALLINT when using JDBC with PostgreSQL > > > Key: SPARK-28097 > URL: https://issues.apache.org/jira/browse/SPARK-28097 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: Seth Fitzsimmons >Assignee: Seth Fitzsimmons >Priority: Minor > > PostgreSQL doesn't have {{TINYINT}}, which would map directly, but > {{SMALLINT}}s are sufficient for uni-directional translation (i.e. when > writing). > This is equivalent to a user selecting {{'byteColumn.cast(ShortType)}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28097) Map ByteType to SMALLINT when using JDBC with PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-28097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28097. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24845 [https://github.com/apache/spark/pull/24845] > Map ByteType to SMALLINT when using JDBC with PostgreSQL > > > Key: SPARK-28097 > URL: https://issues.apache.org/jira/browse/SPARK-28097 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: Seth Fitzsimmons >Assignee: Seth Fitzsimmons >Priority: Minor > Fix For: 3.0.0 > > > PostgreSQL doesn't have {{TINYINT}}, which would map directly, but > {{SMALLINT}}s are sufficient for uni-directional translation (i.e. when > writing). > This is equivalent to a user selecting {{'byteColumn.cast(ShortType)}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18829) Printing to logger
[ https://issues.apache.org/jira/browse/SPARK-18829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887450#comment-16887450 ] Alexander Tronchin-James commented on SPARK-18829: -- FWIW, the showString method on datasets is private, so it doesn't seem possible to call except by internal Dataset methods. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L295 > Printing to logger > -- > > Key: SPARK-18829 > URL: https://issues.apache.org/jira/browse/SPARK-18829 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.2 > Environment: ALL >Reporter: David Hodeffi >Priority: Trivial > Labels: easyfix, patch > Original Estimate: 1h > Remaining Estimate: 1h > > I would like to print dataframe.show or df.explain(true) into log file. > right now the code print to standard output without a way to redirect it. > It also cannot be configured on log4j.properties. > My suggestion is to write to the logger and standard output. > i.e > class DataFrame {.. > override def explain(extended: Boolean): Unit = { > val explain = ExplainCommand(queryExecution.logical, extended = extended) > sqlContext.executePlan(explain).executedPlan.executeCollect().foreach { > // scalastyle:off println > r => { > println(r.getString(0)) > logger.debug(r.getString(0)) > } > } > // scalastyle:on println > } > } > def show(numRows: Int, truncate: Boolean): Unit = { > val str =showString(numRows, truncate) > println(str) > logger.debug(str) > } > } -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19256) Hive bucketing support
[ https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887313#comment-16887313 ] Cheng Su commented on SPARK-19256: -- [~aditya.dataengg] - I started the work by submitting a PR ([https://github.com/apache/spark/pull/23163]) for https://issues.apache.org/jira/browse/SPARK-26164, and it is still under review. I will ping relevant reviewers to see whether we can speed it up, thanks. > Hive bucketing support > -- > > Key: SPARK-19256 > URL: https://issues.apache.org/jira/browse/SPARK-19256 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Priority: Minor > > JIRA to track design discussions and tasks related to Hive bucketing support > in Spark. > Proposal : > https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28429) SQL Datetime util function being casted to double instead of timestamp
Dylan Guedes created SPARK-28429: Summary: SQL Datetime util function being casted to double instead of timestamp Key: SPARK-28429 URL: https://issues.apache.org/jira/browse/SPARK-28429 Project: Spark Issue Type: Sub-task Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Dylan Guedes In the code below, 'now()+'100 days' are casted to double and then an error is thrown: {code:sql} CREATE TEMP VIEW v_window AS SELECT i, min(i) over (order by i range between '1 day' preceding and '10 days' following) as min_i FROM range(now(), now()+'100 days', '1 hour') i; {code} Error: {code:sql} cannot resolve '(current_timestamp() + CAST('100 days' AS DOUBLE))' due to data type mismatch: differing types in '(current_timestamp() + CAST('100 days' AS DOUBLE))' (timestamp and double).;{code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28288) Convert and port 'window.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887294#comment-16887294 ] YoungGyu Chun edited comment on SPARK-28288 at 7/17/19 6:11 PM: hi [~hyukjin.kwon], After merging SPARK-28359 I still see some errors from query 11 to query 16: {code:sql} --- a/sql/core/src/test/resources/sql-tests/results/window.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-window.sql.out @@ -21,10 +21,10 @@ struct<> -- !query 1 -SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val ROWS CURRENT ROW) FROM testData +SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY val ROWS CURRENT ROW) FROM testData ORDER BY cate, val -- !query 1 schema -struct +struct -- !query 1 output NULL NULL0 3 NULL1 @@ -38,10 +38,10 @@ NULLa 0 -- !query 2 -SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val +SELECT udf(val), cate, sum(val) OVER(PARTITION BY cate ORDER BY val ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) FROM testData ORDER BY cate, val -- !query 2 schema -struct +struct -- !query 2 output NULL NULL3 3 NULL3 @@ -55,20 +55,27 @@ NULLa 1 -- !query 3 -SELECT val_long, cate, sum(val_long) OVER(PARTITION BY cate ORDER BY val_long -ROWS BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY cate, val_long +SELECT val_long, udf(cate), sum(val_long) OVER(PARTITION BY cate ORDER BY val_long +ROWS BETWEEN CURRENT ROW AND CAST(2147483648 AS int) FOLLOWING) FROM testData ORDER BY cate, val_long -- !query 3 schema -struct<> +struct -- !query 3 output -org.apache.spark.sql.AnalysisException -cannot resolve 'ROWS BETWEEN CURRENT ROW AND 2147483648L FOLLOWING' due to data type mismatch: The data type of the upper bound 'bigint' does not match the expected data type 'int'.; line 1 pos 41 +NULL NULL1 +1 NULL1 +1 a 2147483654 +1 a 2147483653 +2 a 2147483652 +2147483650 a 2147483650 +NULL b 2147483653 +3 b 2147483653 +2147483650 b 2147483650 -- !query 4 -SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val RANGE 1 PRECEDING) FROM testData +SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY val RANGE 1 PRECEDING) FROM testData ORDER BY cate, val -- !query 4 schema -struct +struct -- !query 4 output NULL NULL0 3 NULL1 @@ -82,10 +89,10 @@ NULLa 0 -- !query 5 -SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val +SELECT val, udf(cate), sum(val) OVER(PARTITION BY cate ORDER BY val RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val -- !query 5 schema -struct +struct -- !query 5 output NULL NULLNULL 3 NULL3 @@ -99,10 +106,10 @@ NULL a NULL -- !query 6 -SELECT val_long, cate, sum(val_long) OVER(PARTITION BY cate ORDER BY val_long +SELECT val_long, udf(cate), sum(val_long) OVER(PARTITION BY cate ORDER BY val_long RANGE BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY cate, val_long -- !query 6 schema -struct +struct -- !query 6 output NULL NULLNULL 1 NULL1 @@ -116,10 +123,10 @@ NULL b NULL -- !query 7 -SELECT val_double, cate, sum(val_double) OVER(PARTITION BY cate ORDER BY val_double +SELECT val_double, udf(cate), sum(val_double) OVER(PARTITION BY cate ORDER BY val_double RANGE BETWEEN CURRENT ROW AND 2.5 FOLLOWING) FROM testData ORDER BY cate, val_double -- !query 7 schema -struct +struct -- !query 7 output NULL NULLNULL 1.0NULL1.0 @@ -133,10 +140,10 @@ NULL NULLNULL -- !query 8 -SELECT val_date, cate, max(val_date) OVER(PARTITION BY cate ORDER BY val_date +SELECT val_date, udf(cate), max(val_date) OVER(PARTITION BY cate ORDER BY val_date RANGE BETWEEN CURRENT ROW AND 2 FOLLOWING) FROM testData ORDER BY cate, val_date -- !query 8 schema -struct +struct -- !query 8 output NULL NULLNULL 2017-08-01 NULL2017-08-01 @@ -150,11 +157,11 @@ NULL NULLNULL -- !query 9 -SELECT val_timestamp, cate, avg(val_timestamp) OVER(PARTITION BY cate ORDER BY val_timestamp +SELECT val_timestamp, udf(cate), avg(val_timestamp) OVER(PARTITION BY cate ORDER BY val_timestamp RANGE BETWEEN CURRENT ROW AND interval 23 days 4 hours FOLLOWING) FROM testData ORDER BY cate, val_timestamp -- !query 9 schema -struct +struct -- !query 9 output NULL NULLNULL 2017-07-31 17:00:00NULL1.5015456E9 @@ -168,10 +175,10 @@ NULL NULLNULL -- !query 10 -SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val DESC +SELECT val, udf(cate), sum(val) OVER(PARTITION BY cate ORDER BY val DESC RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val -- !query 10 schema -struct +struct -- !query 10 output NULL
[jira] [Created] (SPARK-28428) Spark `exclude` always expecting `()`
Dylan Guedes created SPARK-28428: Summary: Spark `exclude` always expecting `()` Key: SPARK-28428 URL: https://issues.apache.org/jira/browse/SPARK-28428 Project: Spark Issue Type: Sub-task Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Dylan Guedes SparkSQL `exclude` always expects a following call to `()`, however, PgSQL `exclude` does not. Examples: {code:sql} SELECT sum(unique1) over (rows between 2 preceding and 2 following exclude no others), unique1, four FROM tenk1 WHERE unique1 < 10; {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28288) Convert and port 'window.sql' into UDF test base
[ https://issues.apache.org/jira/browse/SPARK-28288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887294#comment-16887294 ] YoungGyu Chun commented on SPARK-28288: --- hi [~hyukjin.kwon], After merging SPARK-28359 I still see some errors from query 11 to query 16: {code:sql} --- a/sql/core/src/test/resources/sql-tests/results/window.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-window.sql.out @@ -21,10 +21,10 @@ struct<> -- !query 1 -SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val ROWS CURRENT ROW) FROM testData +SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY val ROWS CURRENT ROW) FROM testData ORDER BY cate, val -- !query 1 schema -struct +struct -- !query 1 output NULL NULL0 3 NULL1 @@ -38,10 +38,10 @@ NULLa 0 -- !query 2 -SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val +SELECT udf(val), cate, sum(val) OVER(PARTITION BY cate ORDER BY val ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) FROM testData ORDER BY cate, val -- !query 2 schema -struct +struct -- !query 2 output NULL NULL3 3 NULL3 @@ -55,20 +55,27 @@ NULLa 1 -- !query 3 -SELECT val_long, cate, sum(val_long) OVER(PARTITION BY cate ORDER BY val_long -ROWS BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY cate, val_long +SELECT val_long, udf(cate), sum(val_long) OVER(PARTITION BY cate ORDER BY val_long +ROWS BETWEEN CURRENT ROW AND CAST(2147483648 AS int) FOLLOWING) FROM testData ORDER BY cate, val_long -- !query 3 schema -struct<> +struct -- !query 3 output -org.apache.spark.sql.AnalysisException -cannot resolve 'ROWS BETWEEN CURRENT ROW AND 2147483648L FOLLOWING' due to data type mismatch: The data type of the upper bound 'bigint' does not match the expected data type 'int'.; line 1 pos 41 +NULL NULL1 +1 NULL1 +1 a 2147483654 +1 a 2147483653 +2 a 2147483652 +2147483650 a 2147483650 +NULL b 2147483653 +3 b 2147483653 +2147483650 b 2147483650 -- !query 4 -SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val RANGE 1 PRECEDING) FROM testData +SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY val RANGE 1 PRECEDING) FROM testData ORDER BY cate, val -- !query 4 schema -struct +struct -- !query 4 output NULL NULL0 3 NULL1 @@ -82,10 +89,10 @@ NULLa 0 -- !query 5 -SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val +SELECT val, udf(cate), sum(val) OVER(PARTITION BY cate ORDER BY val RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val -- !query 5 schema -struct +struct -- !query 5 output NULL NULLNULL 3 NULL3 @@ -99,10 +106,10 @@ NULL a NULL -- !query 6 -SELECT val_long, cate, sum(val_long) OVER(PARTITION BY cate ORDER BY val_long +SELECT val_long, udf(cate), sum(val_long) OVER(PARTITION BY cate ORDER BY val_long RANGE BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY cate, val_long -- !query 6 schema -struct +struct -- !query 6 output NULL NULLNULL 1 NULL1 @@ -116,10 +123,10 @@ NULL b NULL -- !query 7 -SELECT val_double, cate, sum(val_double) OVER(PARTITION BY cate ORDER BY val_double +SELECT val_double, udf(cate), sum(val_double) OVER(PARTITION BY cate ORDER BY val_double RANGE BETWEEN CURRENT ROW AND 2.5 FOLLOWING) FROM testData ORDER BY cate, val_double -- !query 7 schema -struct +struct -- !query 7 output NULL NULLNULL 1.0NULL1.0 @@ -133,10 +140,10 @@ NULL NULLNULL -- !query 8 -SELECT val_date, cate, max(val_date) OVER(PARTITION BY cate ORDER BY val_date +SELECT val_date, udf(cate), max(val_date) OVER(PARTITION BY cate ORDER BY val_date RANGE BETWEEN CURRENT ROW AND 2 FOLLOWING) FROM testData ORDER BY cate, val_date -- !query 8 schema -struct +struct -- !query 8 output NULL NULLNULL 2017-08-01 NULL2017-08-01 @@ -150,11 +157,11 @@ NULL NULLNULL -- !query 9 -SELECT val_timestamp, cate, avg(val_timestamp) OVER(PARTITION BY cate ORDER BY val_timestamp +SELECT val_timestamp, udf(cate), avg(val_timestamp) OVER(PARTITION BY cate ORDER BY val_timestamp RANGE BETWEEN CURRENT ROW AND interval 23 days 4 hours FOLLOWING) FROM testData ORDER BY cate, val_timestamp -- !query 9 schema -struct +struct -- !query 9 output NULL NULLNULL 2017-07-31 17:00:00NULL1.5015456E9 @@ -168,10 +175,10 @@ NULL NULLNULL -- !query 10 -SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val DESC +SELECT val, udf(cate), sum(val) OVER(PARTITION BY cate ORDER BY val DESC RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val -- !query 10 schema -struct +struct -- !query 10 output NULL NULLNULL 3 NULL3 @@ -185,57
[jira] [Created] (SPARK-28427) Support more Postgres JSON functions
Josh Rosen created SPARK-28427: -- Summary: Support more Postgres JSON functions Key: SPARK-28427 URL: https://issues.apache.org/jira/browse/SPARK-28427 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Josh Rosen Postgres features a number of JSON functions that are missing in Spark: https://www.postgresql.org/docs/9.3/functions-json.html Redshift's JSON functions (https://docs.aws.amazon.com/redshift/latest/dg/json-functions.html) have partial overlap with the Postgres list. Some of these functions can be expressed in terms of compositions of existing Spark functions. For example, I think that {{json_array_length}} can be expressed with {{cardinality}} and {{from_json}}, but there's a caveat related to legacy Hive compatibility (see the demo notebook at https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5796212617691211/45530874214710/4901752417050771/latest.html for more details). I'm filing this ticket so that we can triage the list of Postgres JSON features and decide which ones make sense to support in Spark. After we've done that, we can create individual tickets for specific functions and features. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28241) Show metadata operations on ThriftServerTab
[ https://issues.apache.org/jira/browse/SPARK-28241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28241: Issue Type: Sub-task (was: Improvement) Parent: SPARK-28426 > Show metadata operations on ThriftServerTab > --- > > Key: SPARK-28241 > URL: https://issues.apache.org/jira/browse/SPARK-28241 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > !https://user-images.githubusercontent.com/5399861/60579741-4cd2c180-9db6-11e9-822a-0433be509b67.png! -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24570) SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel SQL, DBVisualizer.etc)
[ https://issues.apache.org/jira/browse/SPARK-24570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24570: Issue Type: Sub-task (was: Improvement) Parent: SPARK-28426 > SparkSQL - show schemas/tables in dropdowns of SQL client tools (ie Squirrel > SQL, DBVisualizer.etc) > --- > > Key: SPARK-24570 > URL: https://issues.apache.org/jira/browse/SPARK-24570 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: t oo >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > Attachments: connect-to-sql-db-ssms-locate-table.png, hive.png, > spark.png > > > An end-user SQL client tool (ie in the screenshot) can list tables from > hiveserver2 and major DBs (Mysql, postgres,oracle, MSSQL..etc). But with > SparkSQL it does not display any tables. This would be very convenient for > users. > This is the exception in the client tool (Aqua Data Studio): > {code:java} > Title: An Error Occurred > Summary: Unable to Enumerate Result > Start Message > > org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '`*`' expecting STRING(line 1, pos 38) > == SQL == > SHOW TABLE EXTENDED FROM sit1_pb LIKE `*` > --^^^ > End Message > > Start Stack Trace > > java.sql.SQLException: org.apache.spark.sql.catalyst.parser.ParseException: > mismatched input '`*`' expecting STRING(line 1, pos 38) > == SQL == > SHOW TABLE EXTENDED FROM sit1_pb LIKE `*` > --^^^ > at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296) > at com.aquafold.aquacore.open.rdbms.drivers.hive.Qꐨꈬꈦꁐ.execute(Unknown > Source) > at \\.\\.\\हिñçêČάй語简�?한\\.gᚵ᠃᠍ꃰint.execute(Unknown Source) > at com.common.ui.tree.hꐊᠱꇗꇐ9int.yW(Unknown Source) > at com.common.ui.tree.hꐊᠱꇗꇐ9int$1.process(Unknown Source) > at com.common.ui.util.BackgroundThread.run(Unknown Source) > End Stack Trace > > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24196) Spark Thrift Server - SQL Client connections does't show db artefacts
[ https://issues.apache.org/jira/browse/SPARK-24196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-24196: Issue Type: Sub-task (was: Bug) Parent: SPARK-28426 > Spark Thrift Server - SQL Client connections does't show db artefacts > - > > Key: SPARK-24196 > URL: https://issues.apache.org/jira/browse/SPARK-24196 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: rr >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > Attachments: screenshot-1.png > > > When connecting to Spark Thrift Server via JDBC artefacts(db objects are not > showing up) > whereas when connecting to hiveserver2 it shows the schema, tables, columns > ... > SQL Client user: IBM Data Studio, DBeaver SQL Client -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28104) Implement Spark's own GetColumnsOperation
[ https://issues.apache.org/jira/browse/SPARK-28104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-28104: Issue Type: Sub-task (was: Improvement) Parent: SPARK-28426 > Implement Spark's own GetColumnsOperation > - > > Key: SPARK-28104 > URL: https://issues.apache.org/jira/browse/SPARK-28104 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > SPARK-24196 and SPARK-24570 implemented Spark's own {{GetSchemasOperation}} > and {{GetTablesOperation}}. We also need implement Spark's own > {{GetColumnsOperation}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28426) Metadata Handling in Thrift Server
Xiao Li created SPARK-28426: --- Summary: Metadata Handling in Thrift Server Key: SPARK-28426 URL: https://issues.apache.org/jira/browse/SPARK-28426 Project: Spark Issue Type: Umbrella Components: SQL Affects Versions: 3.0.0 Reporter: Xiao Li Currently, only the `executeStatement is handled` for SQL commands. But others like `getTables`, `getSchemas`, `getColumns` and so on fallback to an in-memory derby empty catalog. Then some of BI tools could not show the correct object information. This umbrella Jira is tracking the related improvement of Thrift-server. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28425) Add more Date/Time Operators
[ https://issues.apache.org/jira/browse/SPARK-28425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28425: Description: ||Operator||Example||Result|| |{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 01:00:00'}}| |{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 23:00:00'}}| |{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 12:00'}}|{{interval '1 day 15:00:00'}}| |{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}| |{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}| |{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}| |{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}| https://www.postgresql.org/docs/11/functions-datetime.html was: ||Operator||Example||Result|| |{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 01:00:00'}}| |{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 23:00:00'}}| |{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 12:00'}}|{{interval '1 day 15:00:00'}}| |{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}| |{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}| |{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}| |{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}| > Add more Date/Time Operators > > > Key: SPARK-28425 > URL: https://issues.apache.org/jira/browse/SPARK-28425 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Operator||Example||Result|| > |{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 > 01:00:00'}}| > |{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 > 23:00:00'}}| > |{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 > 12:00'}}|{{interval '1 day 15:00:00'}}| > |{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}| > |{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}| > |{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}| > |{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}| > https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28425) Add more Date/Time Operators
Yuming Wang created SPARK-28425: --- Summary: Add more Date/Time Operators Key: SPARK-28425 URL: https://issues.apache.org/jira/browse/SPARK-28425 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang ||Operator||Example||Result|| |{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 01:00:00'}}| |{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 23:00:00'}}| |{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 12:00'}}|{{interval '1 day 15:00:00'}}| |{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}| |{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}| |{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}| |{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}| -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28424) Improve interval input
[ https://issues.apache.org/jira/browse/SPARK-28424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28424: Description: Example: {code:sql} INTERVAL '1 day 2:03:04' {code} https://www.postgresql.org/docs/11/datatype-datetime.html was: Example: {code:sql} interval '1 hour' {code} > Improve interval input > --- > > Key: SPARK-28424 > URL: https://issues.apache.org/jira/browse/SPARK-28424 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Example: > {code:sql} > INTERVAL '1 day 2:03:04' > {code} > https://www.postgresql.org/docs/11/datatype-datetime.html -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28424) Improve interval input
[ https://issues.apache.org/jira/browse/SPARK-28424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28424: Summary: Improve interval input (was: interval accept string input) > Improve interval input > --- > > Key: SPARK-28424 > URL: https://issues.apache.org/jira/browse/SPARK-28424 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Example: > {code:sql} > interval '1 hour' > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28424) interval accept string input
[ https://issues.apache.org/jira/browse/SPARK-28424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28424: Description: Example: {code:sql} interval '1 hour' {code} > interval accept string input > > > Key: SPARK-28424 > URL: https://issues.apache.org/jira/browse/SPARK-28424 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Example: > {code:sql} > interval '1 hour' > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28424) interval accept string input
Yuming Wang created SPARK-28424: --- Summary: interval accept string input Key: SPARK-28424 URL: https://issues.apache.org/jira/browse/SPARK-28424 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28423) merge Scan and Batch/Stream
Wenchen Fan created SPARK-28423: --- Summary: merge Scan and Batch/Stream Key: SPARK-28423 URL: https://issues.apache.org/jira/browse/SPARK-28423 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28359) Make integrated UDF tests robust by making them no-op
[ https://issues.apache.org/jira/browse/SPARK-28359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28359. - Resolution: Fixed Assignee: Hyukjin Kwon Fix Version/s: 3.0.0 > Make integrated UDF tests robust by making them no-op > - > > Key: SPARK-28359 > URL: https://issues.apache.org/jira/browse/SPARK-28359 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > Current UDFs available in `IntegratedUDFTestUtils` are not exactly no-op. It > converts input column to strings and outputs to strings. > This causes many issues, for instance, > https://github.com/apache/spark/pull/25128 or > https://github.com/apache/spark/pull/25110 > Ideally we should make this UDF virtually noop. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
[ https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-28422: --- Description: {code:java} @pandas_udf('double', PandasUDFType.GROUPED_AGG) def max_udf(v): return v.max() df = spark.range(0, 100) df.udf.register('max_udf', max_udf) df.createTempView('table') # A. This works df.agg(max_udf(df['id'])).show() # B. This doesn't work spark.sql("select max_udf(id) from table").show(){code} Query plan: A: {code:java} == Parsed Logical Plan == 'Aggregate [max_udf('id) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Analyzed Logical Plan == max_udf(id): double Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Optimized Logical Plan == Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Physical Plan == !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] +- Exchange SinglePartition +- *(1) Range (0, 1000, step=1, splits=4) {code} B: {code:java} == Parsed Logical Plan == 'Project [unresolvedalias('max_udf('id), None)] +- 'UnresolvedRelation [table] == Analyzed Logical Plan == max_udf(id): double Project [max_udf(id#0L) AS max_udf(id)#136] +- SubqueryAlias `table` +- Range (0, 100, step=1, splits=Some(4)) == Optimized Logical Plan == Project [max_udf(id#0L) AS max_udf(id)#136] +- Range (0, 100, step=1, splits=Some(4)) == Physical Plan == *(1) Project [max_udf(id#0L) AS max_udf(id)#136] +- *(1) Range (0, 100, step=1, splits=4) {code} was: {code:java} @pandas_udf('double', PandasUDFType.GROUPED_AGG) def max_udf(v): return v.max() df = spark.range(0, 100) df.udf.register('max_udf', max_udf) df.createTempView('table') # A. This works df.agg(max_udf(df['id'])).show() # B. This doesn't work spark.sql("select max_udf(id) from table").show(){code} Query plan: A: {code:java} == Parsed Logical Plan == 'Aggregate [max_udf('id) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Analyzed Logical Plan == max_udf(id): double Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Optimized Logical Plan == Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Physical Plan == !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] +- Exchange SinglePartition +- *(1) Range (0, 1000, step=1, splits=4) {code} B: {code:java} == Parsed Logical Plan == 'Project [unresolvedalias('max_udf('id), None)] +- 'UnresolvedRelation [table] == Analyzed Logical Plan == max_udf(id): double Project [max_udf(id#0L) AS max_udf(id)#136] +- SubqueryAlias `table` +- Range (0, 100, step=1, splits=Some(4)) == Optimized Logical Plan == Project [max_udf(id#0L) AS max_udf(id)#136] +- Range (0, 100, step=1, splits=Some(4)) == Physical Plan == *(1) Project [max_udf(id#0L) AS max_udf(id)#136] +- *(1) Range (0, 100, step=1, splits=4) {code} Maybe related to subquery? > GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause > --- > > Key: SPARK-28422 > URL: https://issues.apache.org/jira/browse/SPARK-28422 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.3 >Reporter: Li Jin >Priority: Major > > > {code:java} > @pandas_udf('double', PandasUDFType.GROUPED_AGG) > def max_udf(v): > return v.max() > df = spark.range(0, 100) > df.udf.register('max_udf', max_udf) > df.createTempView('table') > # A. This works > df.agg(max_udf(df['id'])).show() > # B. This doesn't work > spark.sql("select max_udf(id) from table").show(){code} > > > Query plan: > A: > {code:java} > == Parsed Logical Plan == > 'Aggregate [max_udf('id) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Analyzed Logical Plan == > max_udf(id): double > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Optimized Logical Plan == > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Physical Plan == > !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] > +- Exchange SinglePartition > +- *(1) Range (0, 1000, step=1, splits=4) > {code} > B: > {code:java} > == Parsed Logical Plan == > 'Project [unresolvedalias('max_udf('id), None)] > +- 'UnresolvedRelation [table] > == Analyzed Logical Plan == > max_udf(id): double > Project [max_udf(id#0L) AS max_udf(id)#136] > +- SubqueryAlias `table` > +- Range (0, 100, step=1, splits=Some(4)) > == Optimized Logical Plan == > Project [max_udf(id#0L) AS max_udf(id)#136] > +- Range (0, 100, step=1, splits=Some(4)) > == Physical Plan == >
[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
[ https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-28422: --- Summary: GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause (was: GROUPED_AGG pandas_udf doesn't with spark.sql without group by clause) > GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause > --- > > Key: SPARK-28422 > URL: https://issues.apache.org/jira/browse/SPARK-28422 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.3 >Reporter: Li Jin >Priority: Major > > > {code:java} > @pandas_udf('double', PandasUDFType.GROUPED_AGG) > def max_udf(v): > return v.max() > df = spark.range(0, 100) > df.udf.register('max_udf', max_udf) > df.createTempView('table') > # A. This works > df.agg(max_udf(df['id'])).show() > # B. This doesn't work > spark.sql("select max_udf(id) from table"){code} > > > Query plan: > A: > {code:java} > == Parsed Logical Plan == > 'Aggregate [max_udf('id) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Analyzed Logical Plan == > max_udf(id): double > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Optimized Logical Plan == > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Physical Plan == > !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] > +- Exchange SinglePartition > +- *(1) Range (0, 1000, step=1, splits=4) > {code} > B: > {code:java} > == Parsed Logical Plan == > 'Project [unresolvedalias('max_udf('id), None)] > +- 'UnresolvedRelation [table] > == Analyzed Logical Plan == > max_udf(id): double > Project [max_udf(id#0L) AS max_udf(id)#136] > +- SubqueryAlias `table` > +- Range (0, 100, step=1, splits=Some(4)) > == Optimized Logical Plan == > Project [max_udf(id#0L) AS max_udf(id)#136] > +- Range (0, 100, step=1, splits=Some(4)) > == Physical Plan == > *(1) Project [max_udf(id#0L) AS max_udf(id)#136] > +- *(1) Range (0, 100, step=1, splits=4) > {code} > Maybe related to subquery? -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
[ https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-28422: --- Description: {code:java} @pandas_udf('double', PandasUDFType.GROUPED_AGG) def max_udf(v): return v.max() df = spark.range(0, 100) df.udf.register('max_udf', max_udf) df.createTempView('table') # A. This works df.agg(max_udf(df['id'])).show() # B. This doesn't work spark.sql("select max_udf(id) from table").show(){code} Query plan: A: {code:java} == Parsed Logical Plan == 'Aggregate [max_udf('id) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Analyzed Logical Plan == max_udf(id): double Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Optimized Logical Plan == Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Physical Plan == !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] +- Exchange SinglePartition +- *(1) Range (0, 1000, step=1, splits=4) {code} B: {code:java} == Parsed Logical Plan == 'Project [unresolvedalias('max_udf('id), None)] +- 'UnresolvedRelation [table] == Analyzed Logical Plan == max_udf(id): double Project [max_udf(id#0L) AS max_udf(id)#136] +- SubqueryAlias `table` +- Range (0, 100, step=1, splits=Some(4)) == Optimized Logical Plan == Project [max_udf(id#0L) AS max_udf(id)#136] +- Range (0, 100, step=1, splits=Some(4)) == Physical Plan == *(1) Project [max_udf(id#0L) AS max_udf(id)#136] +- *(1) Range (0, 100, step=1, splits=4) {code} Maybe related to subquery? was: {code:java} @pandas_udf('double', PandasUDFType.GROUPED_AGG) def max_udf(v): return v.max() df = spark.range(0, 100) df.udf.register('max_udf', max_udf) df.createTempView('table') # A. This works df.agg(max_udf(df['id'])).show() # B. This doesn't work spark.sql("select max_udf(id) from table"){code} Query plan: A: {code:java} == Parsed Logical Plan == 'Aggregate [max_udf('id) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Analyzed Logical Plan == max_udf(id): double Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Optimized Logical Plan == Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Physical Plan == !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] +- Exchange SinglePartition +- *(1) Range (0, 1000, step=1, splits=4) {code} B: {code:java} == Parsed Logical Plan == 'Project [unresolvedalias('max_udf('id), None)] +- 'UnresolvedRelation [table] == Analyzed Logical Plan == max_udf(id): double Project [max_udf(id#0L) AS max_udf(id)#136] +- SubqueryAlias `table` +- Range (0, 100, step=1, splits=Some(4)) == Optimized Logical Plan == Project [max_udf(id#0L) AS max_udf(id)#136] +- Range (0, 100, step=1, splits=Some(4)) == Physical Plan == *(1) Project [max_udf(id#0L) AS max_udf(id)#136] +- *(1) Range (0, 100, step=1, splits=4) {code} Maybe related to subquery? > GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause > --- > > Key: SPARK-28422 > URL: https://issues.apache.org/jira/browse/SPARK-28422 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.3 >Reporter: Li Jin >Priority: Major > > > {code:java} > @pandas_udf('double', PandasUDFType.GROUPED_AGG) > def max_udf(v): > return v.max() > df = spark.range(0, 100) > df.udf.register('max_udf', max_udf) > df.createTempView('table') > # A. This works > df.agg(max_udf(df['id'])).show() > # B. This doesn't work > spark.sql("select max_udf(id) from table").show(){code} > > > Query plan: > A: > {code:java} > == Parsed Logical Plan == > 'Aggregate [max_udf('id) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Analyzed Logical Plan == > max_udf(id): double > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Optimized Logical Plan == > Aggregate [max_udf(id#64L) AS max_udf(id)#140] > +- Range (0, 1000, step=1, splits=Some(4)) > == Physical Plan == > !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] > +- Exchange SinglePartition > +- *(1) Range (0, 1000, step=1, splits=4) > {code} > B: > {code:java} > == Parsed Logical Plan == > 'Project [unresolvedalias('max_udf('id), None)] > +- 'UnresolvedRelation [table] > == Analyzed Logical Plan == > max_udf(id): double > Project [max_udf(id#0L) AS max_udf(id)#136] > +- SubqueryAlias `table` > +- Range (0, 100, step=1, splits=Some(4)) > == Optimized Logical Plan == > Project [max_udf(id#0L) AS max_udf(id)#136] > +- Range (0, 100, step=1, splits=Some(4)) > ==
[jira] [Created] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql without group by clause
Li Jin created SPARK-28422: -- Summary: GROUPED_AGG pandas_udf doesn't with spark.sql without group by clause Key: SPARK-28422 URL: https://issues.apache.org/jira/browse/SPARK-28422 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.4.3 Reporter: Li Jin {code:java} @pandas_udf('double', PandasUDFType.GROUPED_AGG) def max_udf(v): return v.max() df = spark.range(0, 100) df.udf.register('max_udf', max_udf) df.createTempView('table') # A. This works df.agg(max_udf(df['id'])).show() # B. This doesn't work spark.sql("select max_udf(id) from table"){code} Query plan: A: {code:java} == Parsed Logical Plan == 'Aggregate [max_udf('id) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Analyzed Logical Plan == max_udf(id): double Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Optimized Logical Plan == Aggregate [max_udf(id#64L) AS max_udf(id)#140] +- Range (0, 1000, step=1, splits=Some(4)) == Physical Plan == !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140] +- Exchange SinglePartition +- *(1) Range (0, 1000, step=1, splits=4) {code} B: {code:java} == Parsed Logical Plan == 'Project [unresolvedalias('max_udf('id), None)] +- 'UnresolvedRelation [table] == Analyzed Logical Plan == max_udf(id): double Project [max_udf(id#0L) AS max_udf(id)#136] +- SubqueryAlias `table` +- Range (0, 100, step=1, splits=Some(4)) == Optimized Logical Plan == Project [max_udf(id#0L) AS max_udf(id)#136] +- Range (0, 100, step=1, splits=Some(4)) == Physical Plan == *(1) Project [max_udf(id#0L) AS max_udf(id)#136] +- *(1) Range (0, 100, step=1, splits=4) {code} Maybe related to subquery? -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24283) Make standard scaler work without legacy MLlib
[ https://issues.apache.org/jira/browse/SPARK-24283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-24283. --- Resolution: Duplicate > Make standard scaler work without legacy MLlib > -- > > Key: SPARK-24283 > URL: https://issues.apache.org/jira/browse/SPARK-24283 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0 >Reporter: holdenk >Priority: Trivial > Labels: starter > > Currently StandardScaler converts Spark ML vectors to MLlib vectors during > prediction, we should skip that step. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28246) State of UDAF: buffer is not cleared
[ https://issues.apache.org/jira/browse/SPARK-28246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887005#comment-16887005 ] Hyukjin Kwon commented on SPARK-28246: -- It's implementation details. It's documented as below: {code} * The contract should be that applying the merge function on two initial buffers should just * return the initial buffer itself, i.e. * `merge(initialBuffer, initialBuffer)` should equal `initialBuffer`. {code} If we don't do anything within the initialization, it won't meet this contract. > State of UDAF: buffer is not cleared > > > Key: SPARK-28246 > URL: https://issues.apache.org/jira/browse/SPARK-28246 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 > Environment: Ubuntu Linux 16.04 > Reproducible with option --master local[1] > {code:java} > $ spark-shell --master local[1] > {code} >Reporter: Pavel Parkhomenko >Priority: Major > > Buffer object for UserDefinedAggregateFunction contains data from previous > iteration. For example, > {code:java} > import org.apache.spark.sql.Row > import org.apache.spark.sql.expressions.{MutableAggregationBuffer, > UserDefinedAggregateFunction} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions.callUDF > import java.util.Arrays.asList > val df = spark.createDataFrame( > asList( > Row(1, "a"), > Row(2, "b")), > StructType(List( > StructField("id", IntegerType), > StructField("value", StringType > trait Min extends UserDefinedAggregateFunction { > override val inputSchema: StructType = > StructType(Array(StructField("value", StringType))) > override val bufferSchema: StructType = StructType(Array(StructField("min", > StringType))) > override def dataType: DataType = StringType > override def deterministic: Boolean = true > override def update(buffer: MutableAggregationBuffer, input: Row): Unit = > if (input(0) != null && (buffer(0) == null || buffer.getString(0) > > input.getString(0))) buffer(0) = input(0) > override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = > update(buffer1, buffer2) > override def evaluate(buffer: Row): Any = buffer(0) > } > class GoodMin extends Min { > override def initialize(buffer: MutableAggregationBuffer): Unit = buffer(0) > = None > } > class BadMin extends Min { > override def initialize(buffer: MutableAggregationBuffer): Unit = {} > } > spark.udf.register("goodmin", new GoodMin) > spark.udf.register("badmin", new BadMin) > df groupBy "id" agg callUDF("goodmin", $"value") show false > df groupBy "id" agg callUDF("badmin", $"value") show false > {code} > Output is > {noformat} > scala> df groupBy "id" agg callUDF("goodmin", $"value") show false > +---+--+ > |id |goodmin(value)| > +---+--+ > |1 |a | > |2 |b | > +---+--+ > scala> df groupBy "id" agg callUDF("badmin", $"value") show false > +---+-+ > |id |badmin(value)| > +---+-+ > |1 |a | > |2 |a | > +---+-+ > {noformat} > The difference between GoodMin and BadMin is a buffer initialization. > *This example could be reproduced with a single worker thread only*. To > reproduce it is mandatory to run spark shell with option > {code:java} > spark-shell --master local[1] > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27027) from_avro function does not deserialize the Avro record of a struct column type correctly
[ https://issues.apache.org/jira/browse/SPARK-27027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27027. -- Resolution: Duplicate Seems a duplicate of SPARK-27798 > from_avro function does not deserialize the Avro record of a struct column > type correctly > - > > Key: SPARK-27027 > URL: https://issues.apache.org/jira/browse/SPARK-27027 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Hien Luu >Priority: Minor > > {{from_avro}} function produces wrong output of a struct field. See the > output at the bottom of the description > {code} > import org.apache.spark.sql.types._ > import org.apache.spark.sql.avro._ > import org.apache.spark.sql.functions._ > spark.version > val df = Seq((1, "John Doe", 30), (2, "Mary Jane", 25), (3, "Josh Duke", > 50)).toDF("id", "name", "age") > val dfStruct = df.withColumn("value", struct("name","age")) > dfStruct.show > dfStruct.printSchema > val dfKV = dfStruct.select(to_avro('id).as("key"), > to_avro('value).as("value")) > val expectedSchema = StructType(Seq(StructField("name", StringType, > true),StructField("age", IntegerType, false))) > val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString > val avroTypeStr = s""" > |{ > | "type": "int", > | "name": "key" > |} > """.stripMargin > dfKV.select(from_avro('key, avroTypeStr)).show > dfKV.select(from_avro('value, avroTypeStruct)).show > // output for the last statement and that is not correct > +-+ > |from_avro(value, struct)| > +-+ > | [Josh Duke, 50]| > | [Josh Duke, 50]| > | [Josh Duke, 50]| > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27820) case insensitive resolver should be used in GetMapValue
[ https://issues.apache.org/jira/browse/SPARK-27820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-27820. -- Resolution: Won't Fix > case insensitive resolver should be used in GetMapValue > --- > > Key: SPARK-27820 > URL: https://issues.apache.org/jira/browse/SPARK-27820 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.1 >Reporter: Michel Lemay >Priority: Minor > > When extracting a key value from a MapType, it calls GetMapValue > (complexTypeExtractors.scala) and only use the map type ordering. It should > use the resolver instead. > Starting spark with: `{{spark-shell --conf spark.sql.caseSensitive=false`}} > Given dataframe: > {{val df = List(Map("a" -> 1), Map("A" -> 2)).toDF("m")}} > And executing any of these will only return one row: case insensitive in the > name of the column but case sensitive match in the keys of the map. > {{df.filter($"M.A".isNotNull).count}} > {{df.filter($"M"("A").isNotNull).count > df.filter($"M".getField("A").isNotNull).count}} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28421) SparseVector.apply performance optimization
zhengruifeng created SPARK-28421: Summary: SparseVector.apply performance optimization Key: SPARK-28421 URL: https://issues.apache.org/jira/browse/SPARK-28421 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0.0 Reporter: zhengruifeng Current impl of SparseVector.apply is inefficient: on each call, breeze.linalg.SparseVector & breeze.collection.mutable.SparseArray are created internally, then binary-search is used to search the input position. This place should be optimized like .ml.SparseMatrix, which directly use binary search, without conversion to breeze.linalg.Matrix. I tested the performance and found that if we avoid the internal conversions, then a 2.5~5X speed up can be obtained. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API
[ https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886913#comment-16886913 ] Gabor Somogyi commented on SPARK-28415: --- Kafka 0.8 support is deprecated as of Spark 2.3.0. > Add messageHandler to Kafka 10 direct stream API > > > Key: SPARK-28415 > URL: https://issues.apache.org/jira/browse/SPARK-28415 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.4.3 >Reporter: Michael Spector >Priority: Major > > Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new > Kafka API is what prevents us from upgrading our processes to use it, and > here's why: > # messageHandler() allowed parsing / filtering / projecting huge JSON files > at an early stage (only a small subset of JSON fields is required for a > process), without this current cluster configuration doesn't keep up with the > traffic. > # Transforming Kafka events right after a stream is created prevents from > using HasOffsetRanges interface later. This means that whole message must be > propagated to the end of a pipeline, which is very ineffective. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API
[ https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886906#comment-16886906 ] Michael Spector commented on SPARK-28415: - [~gsomogyi] Can you tell that Kafka 0.8 API will be supported forever, or it will be deprecated at some point? If the latter is true, then this basic functionality must be preserved even though the API is different. > Add messageHandler to Kafka 10 direct stream API > > > Key: SPARK-28415 > URL: https://issues.apache.org/jira/browse/SPARK-28415 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.4.3 >Reporter: Michael Spector >Priority: Major > > Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new > Kafka API is what prevents us from upgrading our processes to use it, and > here's why: > # messageHandler() allowed parsing / filtering / projecting huge JSON files > at an early stage (only a small subset of JSON fields is required for a > process), without this current cluster configuration doesn't keep up with the > traffic. > # Transforming Kafka events right after a stream is created prevents from > using HasOffsetRanges interface later. This means that whole message must be > propagated to the end of a pipeline, which is very ineffective. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API
[ https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886905#comment-16886905 ] Gabor Somogyi commented on SPARK-28415: --- I don't see the reason why different API should behave the same way. > Add messageHandler to Kafka 10 direct stream API > > > Key: SPARK-28415 > URL: https://issues.apache.org/jira/browse/SPARK-28415 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.4.3 >Reporter: Michael Spector >Priority: Major > > Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new > Kafka API is what prevents us from upgrading our processes to use it, and > here's why: > # messageHandler() allowed parsing / filtering / projecting huge JSON files > at an early stage (only a small subset of JSON fields is required for a > process), without this current cluster configuration doesn't keep up with the > traffic. > # Transforming Kafka events right after a stream is created prevents from > using HasOffsetRanges interface later. This means that whole message must be > propagated to the end of a pipeline, which is very ineffective. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28420) Date/Time Functions: date_part
Yuming Wang created SPARK-28420: --- Summary: Date/Time Functions: date_part Key: SPARK-28420 URL: https://issues.apache.org/jira/browse/SPARK-28420 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang ||Function||Return Type||Description||Example||Result|| |{{date_part(}}{{text}}{{, }}{{timestamp}}{{)}}|{{double precision}}|Get subfield (equivalent to {{extract}}); see [Section 9.9.1|https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT]|{{date_part('hour', timestamp '2001-02-16 20:38:40')}}|{{20}}| |{{date_part(}}{{text}}{{, }}{{interval}}{{)}}|{{double precision}}|Get subfield (equivalent to {{extract}}); see [Section 9.9.1|https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT]|{{date_part('month', interval '2 years 3 months')}}|{{3}}| We can replace it with {{extract(field from timestamp)}}. https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API
[ https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886900#comment-16886900 ] Michael Spector commented on SPARK-28415: - Well, it depends what's your perspective on the issue. If it's a regression (and this is a regression from our perspective, since we're unable to upgrade to the new API) - then it's a bug. > Add messageHandler to Kafka 10 direct stream API > > > Key: SPARK-28415 > URL: https://issues.apache.org/jira/browse/SPARK-28415 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.4.3 >Reporter: Michael Spector >Priority: Major > > Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new > Kafka API is what prevents us from upgrading our processes to use it, and > here's why: > # messageHandler() allowed parsing / filtering / projecting huge JSON files > at an early stage (only a small subset of JSON fields is required for a > process), without this current cluster configuration doesn't keep up with the > traffic. > # Transforming Kafka events right after a stream is created prevents from > using HasOffsetRanges interface later. This means that whole message must be > propagated to the end of a pipeline, which is very ineffective. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API
[ https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-28415: -- Issue Type: New Feature (was: Bug) > Add messageHandler to Kafka 10 direct stream API > > > Key: SPARK-28415 > URL: https://issues.apache.org/jira/browse/SPARK-28415 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: Michael Spector >Priority: Major > > Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new > Kafka API is what prevents us from upgrading our processes to use it, and > here's why: > # messageHandler() allowed parsing / filtering / projecting huge JSON files > at an early stage (only a small subset of JSON fields is required for a > process), without this current cluster configuration doesn't keep up with the > traffic. > # Transforming Kafka events right after a stream is created prevents from > using HasOffsetRanges interface later. This means that whole message must be > propagated to the end of a pipeline, which is very ineffective. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API
[ https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886898#comment-16886898 ] Gabor Somogyi commented on SPARK-28415: --- This is more like a new feature than a bug so modified. > Add messageHandler to Kafka 10 direct stream API > > > Key: SPARK-28415 > URL: https://issues.apache.org/jira/browse/SPARK-28415 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.3 >Reporter: Michael Spector >Priority: Major > > Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new > Kafka API is what prevents us from upgrading our processes to use it, and > here's why: > # messageHandler() allowed parsing / filtering / projecting huge JSON files > at an early stage (only a small subset of JSON fields is required for a > process), without this current cluster configuration doesn't keep up with the > traffic. > # Transforming Kafka events right after a stream is created prevents from > using HasOffsetRanges interface later. This means that whole message must be > propagated to the end of a pipeline, which is very ineffective. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API
[ https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-28415: -- Component/s: (was: Structured Streaming) DStreams > Add messageHandler to Kafka 10 direct stream API > > > Key: SPARK-28415 > URL: https://issues.apache.org/jira/browse/SPARK-28415 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.4.3 >Reporter: Michael Spector >Priority: Major > > Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new > Kafka API is what prevents us from upgrading our processes to use it, and > here's why: > # messageHandler() allowed parsing / filtering / projecting huge JSON files > at an early stage (only a small subset of JSON fields is required for a > process), without this current cluster configuration doesn't keep up with the > traffic. > # Transforming Kafka events right after a stream is created prevents from > using HasOffsetRanges interface later. This means that whole message must be > propagated to the end of a pipeline, which is very ineffective. > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28419) A patch for SparkThriftServer support multi-tenant authentication
[ https://issues.apache.org/jira/browse/SPARK-28419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-28419: Affects Version/s: (was: 2.4.0) 3.0.0 > A patch for SparkThriftServer support multi-tenant authentication > - > > Key: SPARK-28419 > URL: https://issues.apache.org/jira/browse/SPARK-28419 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28419) A patch for SparkThriftServer support multi-tenant authentication
angerszhu created SPARK-28419: - Summary: A patch for SparkThriftServer support multi-tenant authentication Key: SPARK-28419 URL: https://issues.apache.org/jira/browse/SPARK-28419 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: angerszhu -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28363) Enable run test with clover
[ https://issues.apache.org/jira/browse/SPARK-28363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28363. -- Resolution: Incomplete No feedback. > Enable run test with clover > --- > > Key: SPARK-28363 > URL: https://issues.apache.org/jira/browse/SPARK-28363 > Project: Spark > Issue Type: Task > Components: Build, Tests >Affects Versions: 2.3.0 >Reporter: luhuachao >Priority: Major > > Now ,compilation error occured when run test with clover as the reason > java-scala cross compile in spark. refer to > [https://confluence.atlassian.com/cloverkb/java-scala-cross-compilation-error-cannot-find-symbol-765593874.html]; > Do we need to modify the pom.xml to support clover in spark ? > h1. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28364) Unable to read complete data from an external hive table stored as ORC that points to a managed table's data files which is getting stored in sub-directories.
[ https://issues.apache.org/jira/browse/SPARK-28364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28364. -- Resolution: Incomplete no feedback. > Unable to read complete data from an external hive table stored as ORC that > points to a managed table's data files which is getting stored in > sub-directories. > -- > > Key: SPARK-28364 > URL: https://issues.apache.org/jira/browse/SPARK-28364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Debdut Mukherjee >Priority: Major > Attachments: pic.PNG > > > Unable to read complete data from an external hive table stored as ORC that > points to a managed table's data files (ORC) which is getting stored in > sub-directories. > The count also does not match unless the path is given with a *. > *Example This works:-* > "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/*" > But the above creates a blank directory named ' * ' in ADLS(Azure Data Lake > Store) location > > The below one does not work when a SELECT COUNT ( * ) is executed on this > external table. It gives partial count. > {code} > CREATE EXTERNAL TABLE IF NOT EXISTS db1.tbl1 ( > Col_1 string, > Col_2 string > STORED AS ORC > LOCATION "adl://.azuredatalakestore.net/clusters/ path>/hive/warehouse/db2.db/tbl1/" > ) > {code} > > I was looking for a resolution in google, and even adding below lines to the > Databricks Notebook did not solve the problem. > {code} > sqlContext.setConf("mapred.input.dir.recursive","true"); > sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true"); > {code} > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27884) Deprecate Python 2 support in Spark 3.0
[ https://issues.apache.org/jira/browse/SPARK-27884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886796#comment-16886796 ] xifeng commented on SPARK-27884: looks fine > Deprecate Python 2 support in Spark 3.0 > --- > > Key: SPARK-27884 > URL: https://issues.apache.org/jira/browse/SPARK-27884 > Project: Spark > Issue Type: Story > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: release-notes > > Officially deprecate Python 2 support in Spark 3.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19256) Hive bucketing support
[ https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886768#comment-16886768 ] Aditya Prakash commented on SPARK-19256: [~chengsu] any update on this? > Hive bucketing support > -- > > Key: SPARK-19256 > URL: https://issues.apache.org/jira/browse/SPARK-19256 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Priority: Minor > > JIRA to track design discussions and tasks related to Hive bucketing support > in Spark. > Proposal : > https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28418) Flaky Test: pyspark.sql.tests.test_dataframe: test_query_execution_listener_on_collect
Hyukjin Kwon created SPARK-28418: Summary: Flaky Test: pyspark.sql.tests.test_dataframe: test_query_execution_listener_on_collect Key: SPARK-28418 URL: https://issues.apache.org/jira/browse/SPARK-28418 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon {code} ERROR [0.164s]: test_query_execution_listener_on_collect (pyspark.sql.tests.test_dataframe.QueryExecutionListenerTests) -- Traceback (most recent call last): File "/home/jenkins/python/pyspark/sql/tests/test_dataframe.py", line 758, in test_query_execution_listener_on_collect "The callback from the query execution listener should be called after 'collect'") AssertionError: The callback from the query execution listener should be called after 'collect' {code} Seems it can be failed due to not waiting events to be proceeded. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org