date:20200308

[jira] [Resolved] (SPARK-28994) Document working of Adaptive

2020-03-08 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-28994.
--
Resolution: Duplicate

> Document working of Adaptive
> 
>
> Key: SPARK-28994
> URL: https://issues.apache.org/jira/browse/SPARK-28994
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28994) Document working of Adaptive

2020-03-08 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054333#comment-17054333
 ] 

Takeshi Yamamuro commented on SPARK-28994:
--

I closed this cuz I think this was duplicated by the other ticket. Plz reopen 
this if there is any problem.

> Document working of Adaptive
> 
>
> Key: SPARK-28994
> URL: https://issues.apache.org/jira/browse/SPARK-28994
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28993) Document Working of Bucketing

2020-03-08 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054335#comment-17054335
 ] 

Takeshi Yamamuro commented on SPARK-28993:
--

I'll close this cuz this is inactive and I think this topic should be written 
in the SQL tuning guide instead of the SQL references.

> Document Working of Bucketing
> -
>
> Key: SPARK-28993
> URL: https://issues.apache.org/jira/browse/SPARK-28993
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28993) Document Working of Bucketing

2020-03-08 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-28993.
--
Resolution: Invalid

> Document Working of Bucketing
> -
>
> Key: SPARK-28993
> URL: https://issues.apache.org/jira/browse/SPARK-28993
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31081) Make SQLMetrics more readable from UI

2020-03-08 Thread wuyi (Jira)

wuyi created SPARK-31081:


 Summary: Make SQLMetrics more readable from UI
 Key: SPARK-31081
 URL: https://issues.apache.org/jira/browse/SPARK-31081
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: wuyi


It makes metrics harder to read after SPARK-30209 and user may not interest in 
extra info({{stageId/StageAttemptId/taskId }}) when they do not need debug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER

2020-03-08 Thread Sunayan Saikia (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054482#comment-17054482
 ] 

Sunayan Saikia commented on SPARK-20427:


Seems this fix broke the way we could get the column name with the _'name'_  
key via the MetadataBuiler map inside getCatalystType()
Is there a way I could get the column name now while I'm overriding the 
getCatalystType() method?

Please check the Java code below for which things broke.
public Option getCatalystType(int sqlJdbcType, String typeName, int 
size, MetadataBuilder md) {
  String columnName = String.valueOf(md.getMap().get("name").get());

> Issue with Spark interpreting Oracle datatype NUMBER
> 
>
> Key: SPARK-20427
> URL: https://issues.apache.org/jira/browse/SPARK-20427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Alexander Andrushenko
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> In Oracle exists data type NUMBER. When defining a filed in a table of type 
> NUMBER the field has two components, precision and scale.
> For example, NUMBER(p,s) has precision p and scale s. 
> Precision can range from 1 to 38.
> Scale can range from -84 to 127.
> When reading such a filed Spark can create numbers with precision exceeding 
> 38. In our case it has created fields with precision 44,
> calculated as sum of the precision (in our case 34 digits) and the scale (10):
> "...java.lang.IllegalArgumentException: requirement failed: Decimal precision 
> 44 exceeds max precision 38...".
> The result was, that a data frame was read from a table on one schema but 
> could not be inserted in the identical table on other schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31053) mark connector API as Evolving

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31053:
-

Assignee: Wenchen Fan

> mark connector API as Evolving
> --
>
> Key: SPARK-31053
> URL: https://issues.apache.org/jira/browse/SPARK-31053
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31053) mark connector API as Evolving

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31053.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27811
[https://github.com/apache/spark/pull/27811]

> mark connector API as Evolving
> --
>
> Key: SPARK-31053
> URL: https://issues.apache.org/jira/browse/SPARK-31053
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31080) Bugs/missing functions in documents

2020-03-08 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054515#comment-17054515
 ] 

L. C. Hsieh commented on SPARK-31080:
-

There is an article explaining Pivot:

https://databricks.com/blog/2018/11/01/sql-pivot-converting-rows-to-columns.html

Maybe it can be helpful for you.

> Bugs/missing functions in documents
> ---
>
> Key: SPARK-31080
> URL: https://issues.apache.org/jira/browse/SPARK-31080
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Viet
>Priority: Minor
>
> In current document for SQL API, I noticed that there is no section for 
> `PIVOT` keyword, which was introduced from 2.4.0.
> Is there a bug in `mkdocs`? 
> Docs: [https://spark.apache.org/docs/latest/api/sql/]
> P/S: Not sure if this issue should be in here but I cannot found any other 
> place to put this it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31080) Bugs/missing functions in documents

2020-03-08 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054516#comment-17054516
 ] 

L. C. Hsieh commented on SPARK-31080:
-

Then as this is not a bug, I will close this. You can ask questions on Spark 
mailing lists.

> Bugs/missing functions in documents
> ---
>
> Key: SPARK-31080
> URL: https://issues.apache.org/jira/browse/SPARK-31080
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Viet
>Priority: Minor
>
> In current document for SQL API, I noticed that there is no section for 
> `PIVOT` keyword, which was introduced from 2.4.0.
> Is there a bug in `mkdocs`? 
> Docs: [https://spark.apache.org/docs/latest/api/sql/]
> P/S: Not sure if this issue should be in here but I cannot found any other 
> place to put this it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31080) Bugs/missing functions in documents

2020-03-08 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-31080.
-
Resolution: Not A Bug

> Bugs/missing functions in documents
> ---
>
> Key: SPARK-31080
> URL: https://issues.apache.org/jira/browse/SPARK-31080
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Viet
>Priority: Minor
>
> In current document for SQL API, I noticed that there is no section for 
> `PIVOT` keyword, which was introduced from 2.4.0.
> Is there a bug in `mkdocs`? 
> Docs: [https://spark.apache.org/docs/latest/api/sql/]
> P/S: Not sure if this issue should be in here but I cannot found any other 
> place to put this it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31071) Spark Encoders.bean() should allow marking non-null fields in its Spark schema

2020-03-08 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054520#comment-17054520
 ] 

L. C. Hsieh commented on SPARK-31071:
-

What Nonnull annotation we should use? Seems there is no standard Nonnull 
annotation now. 

> Spark Encoders.bean() should allow marking non-null fields in its Spark schema
> --
>
> Key: SPARK-31071
> URL: https://issues.apache.org/jira/browse/SPARK-31071
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark _Encoders.bean()_ method should allow the generated StructType schema 
> fields be *non-nullable*.
> Currently, any non-primitive type is automatically _nullable_. It is 
> hard-coded in the _org.apache.spark.sql.catalyst.JavaTypeReference_ class.  
> This can lead to rather interesting situations... For example, let's say I 
> want to save a dataframe using an Avro format with my own non-spark generated 
> Avro schema. Let's also say that my Avro schema has a field that is non-null 
> (i.e., not a union type). Well, it appears *impossible* to store a dataframe 
> using such an Avro schema since Spark would assume that the field is nullable 
> (as it is in its own schema) which would conflict with Avro schema semantics 
> and throw an exception.
> I propose making a change to the _JavaTypeReference_ class to observe the 
> JSR-305 _Nonnull_ annotation (and its children) on the provided bean class 
> during StructType schema generation. This would allow bean creators to 
> control the resulting Spark schema so much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31071) Spark Encoders.bean() should allow marking non-null fields in its Spark schema

2020-03-08 Thread Kyrill Alyoshin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054521#comment-17054521
 ] 

Kyrill Alyoshin commented on SPARK-31071:
-

_javax.annotation.Nonnull_ - seems like a good choice. You already include 
jsr305-1.3.9.jar with Spark distribution (I am using 2.4.4), so this will not 
even lead to additional dependencies.

> Spark Encoders.bean() should allow marking non-null fields in its Spark schema
> --
>
> Key: SPARK-31071
> URL: https://issues.apache.org/jira/browse/SPARK-31071
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark _Encoders.bean()_ method should allow the generated StructType schema 
> fields be *non-nullable*.
> Currently, any non-primitive type is automatically _nullable_. It is 
> hard-coded in the _org.apache.spark.sql.catalyst.JavaTypeReference_ class.  
> This can lead to rather interesting situations... For example, let's say I 
> want to save a dataframe using an Avro format with my own non-spark generated 
> Avro schema. Let's also say that my Avro schema has a field that is non-null 
> (i.e., not a union type). Well, it appears *impossible* to store a dataframe 
> using such an Avro schema since Spark would assume that the field is nullable 
> (as it is in its own schema) which would conflict with Avro schema semantics 
> and throw an exception.
> I propose making a change to the _JavaTypeReference_ class to observe the 
> JSR-305 _Nonnull_ annotation (and its children) on the provided bean class 
> during StructType schema generation. This would allow bean creators to 
> control the resulting Spark schema so much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30536) Sort-merge join operator spilling performance improvements

2020-03-08 Thread shanyu zhao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated SPARK-30536:

Attachment: spark-30536-explained.pdf

> Sort-merge join operator spilling performance improvements
> --
>
> Key: SPARK-30536
> URL: https://issues.apache.org/jira/browse/SPARK-30536
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sinisa Knezevic
>Priority: Major
> Attachments: spark-30536-explained.pdf
>
>
> Testing with TPC-DS 100 TB benchmark data set showed that some of SQLs 
> (example query 14) are not able to run even with extremely large Spark 
> executor memory.Spark spilling feature has to be enabled, in order to be able 
> to process these SQLs. Processing of SQLs becomes extremely slow when 
> spilling is enabled.The Spark spilling feature is enabled via two parameters: 
> “spark.sql.sortMergeJoinExec.buffer.in.memory.threshold” and 
> “spark.sql.sortMergeJoinExec.buffer.spill.threshold”
> “spark.sql.sortMergeJoinExec.buffer.in.memory.threshold” – when this 
> threshold is reached, the data will be moved from 
> ExternalAppendOnlyUnsafeRowArrey object into UnsafeExternalSorter object.
> “spark.sql.sortMergeJoinExec.buffer.spill.threshold” – when this threshold is 
> reached, the data will be spilled from UnsafeExternalSorter object onto the 
> disk.
>  
> During execution of sort-merge join (Left Semi Join ) for each left join row 
> “right matches” are found and stored into ExternalAppendOnlyUnsafeRowArrey 
> object.In the case of Query 14 there are millions of rows of “right matches”. 
> To run this query spilling is enabled and data is moved from 
> ExternalAppendOnlyUnsafeRowArrey into UnsafeExternalSorter and then spilled 
> onto the disk.When million rows are processed on left side of the join, the 
> iterator on top of spilled “right matches” rows is created each time. This 
> means that millions of time iterator on top of right matches (that are 
> spilled on the disk) is created.The current Spark implementation creates 
> iterator on top of spilled rows and producing I/0 which results into millions 
> of I/0 when million rows are processed.
>  
> To avoid the performance bottleneck this JIRA introducing following solution:
> 1. Implement lazy initialization of UnsafeSorterSpillReader - iterator on top 
> of spilled rows:
>     … During SortMergeJoin (Left Semi Join) execution, the iterator on the 
> spill data is created but no iteration over the data is done.
>    ... Having lazy initialization of UnsafeSorterSpillReader will enable 
> efficient processing of SortMergeJoin even if data is spilled onto disk. 
> Unnecessary I/O will be avoided.
> 2. Decrease initial memory read buffer size in UnsafeSorterSpillReader from 
> 1MB to 1KB:
>     … UnsafeSorterSpillReader constructor takes lot of time due to size of 
> default 1MB memory read buffer.
>     … The code already has logic to increase the memory read buffer if it 
> cannot fit the data, so decreasing the size to 1K is safe and has positive 
> performance impact.
> 3. Improve memory utilization when spilling is enabled in 
> ExternalAppendOnlyUnsafeRowArrey:
>     … In the current implementation, when spilling is enabled, 
> UnsafeExternalSorter object is created and then data moved from 
> ExternalAppendOnlyUnsafeRowArrey object into UnsafeExternalSorter and then 
> ExternalAppendOnlyUnsafeRowArrey object is emptied. Just before 
> ExternalAppendOnlyUnsafeRowArrey object is emptied there are both objects in 
> the memory with the same data. That require double memory and there is 
> duplication of data. This can be avoided.
>     … In the proposed solution, when 
> spark.sql.sortMergeJoinExec.buffer.in.memory.threshold is reached  adding new 
> rows into ExternalAppendOnlyUnsafeRowArray object stops. UnsafeExternalSorter 
> object is created and new rows are added into this object. 
> ExternalAppendOnlyUnsafeRowArray object retains all rows already added into 
> this object. This approach will enable better memory utilization and avoid 
> unnecessary movement of data from one object into another.
>  
> The test of this solution with query 14 and enabled spilling on the disk, 
> showed 500X performance improvements and it didn’t degrade performance of the 
> other SQLs from TPC-DS benchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30536) Sort-merge join operator spilling performance improvements

2020-03-08 Thread shanyu zhao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054555#comment-17054555
 ] 

shanyu zhao commented on SPARK-30536:
-

Uploaded two slides to explain the optimization idea of this PR.

> Sort-merge join operator spilling performance improvements
> --
>
> Key: SPARK-30536
> URL: https://issues.apache.org/jira/browse/SPARK-30536
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sinisa Knezevic
>Priority: Major
> Attachments: spark-30536-explained.pdf
>
>
> Testing with TPC-DS 100 TB benchmark data set showed that some of SQLs 
> (example query 14) are not able to run even with extremely large Spark 
> executor memory.Spark spilling feature has to be enabled, in order to be able 
> to process these SQLs. Processing of SQLs becomes extremely slow when 
> spilling is enabled.The Spark spilling feature is enabled via two parameters: 
> “spark.sql.sortMergeJoinExec.buffer.in.memory.threshold” and 
> “spark.sql.sortMergeJoinExec.buffer.spill.threshold”
> “spark.sql.sortMergeJoinExec.buffer.in.memory.threshold” – when this 
> threshold is reached, the data will be moved from 
> ExternalAppendOnlyUnsafeRowArrey object into UnsafeExternalSorter object.
> “spark.sql.sortMergeJoinExec.buffer.spill.threshold” – when this threshold is 
> reached, the data will be spilled from UnsafeExternalSorter object onto the 
> disk.
>  
> During execution of sort-merge join (Left Semi Join ) for each left join row 
> “right matches” are found and stored into ExternalAppendOnlyUnsafeRowArrey 
> object.In the case of Query 14 there are millions of rows of “right matches”. 
> To run this query spilling is enabled and data is moved from 
> ExternalAppendOnlyUnsafeRowArrey into UnsafeExternalSorter and then spilled 
> onto the disk.When million rows are processed on left side of the join, the 
> iterator on top of spilled “right matches” rows is created each time. This 
> means that millions of time iterator on top of right matches (that are 
> spilled on the disk) is created.The current Spark implementation creates 
> iterator on top of spilled rows and producing I/0 which results into millions 
> of I/0 when million rows are processed.
>  
> To avoid the performance bottleneck this JIRA introducing following solution:
> 1. Implement lazy initialization of UnsafeSorterSpillReader - iterator on top 
> of spilled rows:
>     … During SortMergeJoin (Left Semi Join) execution, the iterator on the 
> spill data is created but no iteration over the data is done.
>    ... Having lazy initialization of UnsafeSorterSpillReader will enable 
> efficient processing of SortMergeJoin even if data is spilled onto disk. 
> Unnecessary I/O will be avoided.
> 2. Decrease initial memory read buffer size in UnsafeSorterSpillReader from 
> 1MB to 1KB:
>     … UnsafeSorterSpillReader constructor takes lot of time due to size of 
> default 1MB memory read buffer.
>     … The code already has logic to increase the memory read buffer if it 
> cannot fit the data, so decreasing the size to 1K is safe and has positive 
> performance impact.
> 3. Improve memory utilization when spilling is enabled in 
> ExternalAppendOnlyUnsafeRowArrey:
>     … In the current implementation, when spilling is enabled, 
> UnsafeExternalSorter object is created and then data moved from 
> ExternalAppendOnlyUnsafeRowArrey object into UnsafeExternalSorter and then 
> ExternalAppendOnlyUnsafeRowArrey object is emptied. Just before 
> ExternalAppendOnlyUnsafeRowArrey object is emptied there are both objects in 
> the memory with the same data. That require double memory and there is 
> duplication of data. This can be avoided.
>     … In the proposed solution, when 
> spark.sql.sortMergeJoinExec.buffer.in.memory.threshold is reached  adding new 
> rows into ExternalAppendOnlyUnsafeRowArray object stops. UnsafeExternalSorter 
> object is created and new rows are added into this object. 
> ExternalAppendOnlyUnsafeRowArray object retains all rows already added into 
> this object. This approach will enable better memory utilization and avoid 
> unnecessary movement of data from one object into another.
>  
> The test of this solution with query 14 and enabled spilling on the disk, 
> showed 500X performance improvements and it didn’t degrade performance of the 
> other SQLs from TPC-DS benchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()

2020-03-08 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054596#comment-17054596
 ] 

Hyukjin Kwon commented on SPARK-31065:
--

There seems two issues. The first is in the JSON schema inference 
(https://github.com/apache/spark/blob/c1986204e59f1e8cc4b611d5a578cb248cb74c28/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala#L115-L122).
 Seems it treats empty strings as nulls. We might need an option to treat this 
as StringType.

Another issue is that {{null}} type is not properly being supported in SQL 
parser (which is used to parse the DDL type string from {{schema_of_json}}). 
Hive supports {{null}} type as {{void}} keyword so we might need to support 
this as a proper string to parse as a {{null}} type.

> Empty string values cause schema_of_json() to return a schema not usable by 
> from_json()
> ---
>
> Key: SPARK-31065
> URL: https://issues.apache.org/jira/browse/SPARK-31065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a reproduction:
>   
> {code:python}
> from pyspark.sql.functions import from_json, schema_of_json
> json = '{"a": ""}'
> df = spark.createDataFrame([(json,)], schema=['json'])
> df.show()
> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> json_schema = schema_of_json(json)
> df.select(from_json('json', json_schema))
> # works fine
> json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema
> df.select(from_json('json', json_schema))
> {code}
> The output:
> {code:java}
> >>> from pyspark.sql.functions import from_json, schema_of_json
> >>> json = '{"a": ""}'
> >>> 
> >>> df = spark.createDataFrame([(json,)], schema=['json'])
> >>> df.show()
> +-+
> | json|
> +-+
> |{"a": ""}|
> +-+
> >>> 
> >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> >>> json_schema = schema_of_json(json)
> >>> df.select(from_json('json', json_schema))
> Traceback (most recent call last):
>   File ".../site-packages/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File 
> ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.functions.from_json.
> : org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 
> 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
> 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
> 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
> 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 
> 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 
> 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 
> 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 
> 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
> 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
> 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 
> 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 
> 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 
> 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 
> 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 
> 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
> 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 
> 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 
> 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 
> 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 
> 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 
> 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 
> 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 
> 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 
> 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 
> 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6)
> == SQL ==
>

[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master

2020-03-08 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054598#comment-17054598
 ] 

Hyukjin Kwon commented on SPARK-31043:
--

[~nchammas], can you show the reproducible steps for that?

> Spark 3.0 built against hadoop2.7 can't start standalone master
> ---
>
> Key: SPARK-31043
> URL: https://issues.apache.org/jira/browse/SPARK-31043
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Critical
> Fix For: 3.0.0
>
>
> trying to start a standalone master when building spark branch 3.0 with 
> hadoop2.7 fails with:
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/w3c/dom/ElementTraversal
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.defineClass(URLClassLoader.java:468)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.access$100(URLClassLoader.java:74)
> at 
> [java.net|http://java.net/]
> .URLClassLoader$1.run(URLClassLoader.java:369)
> ...
> Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
> at 
> [java.net|http://java.net/]
> .URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> ... 42 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31080) Bugs/missing functions in documents

2020-03-08 Thread Viet (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054605#comment-17054605
 ] 

Viet commented on SPARK-31080:
--

Thank you.

> Bugs/missing functions in documents
> ---
>
> Key: SPARK-31080
> URL: https://issues.apache.org/jira/browse/SPARK-31080
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Viet
>Priority: Minor
>
> In current document for SQL API, I noticed that there is no section for 
> `PIVOT` keyword, which was introduced from 2.4.0.
> Is there a bug in `mkdocs`? 
> Docs: [https://spark.apache.org/docs/latest/api/sql/]
> P/S: Not sure if this issue should be in here but I cannot found any other 
> place to put this it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31082) MapOutputTrackerMaster.getMapLocation can't handle last mapIndex

2020-03-08 Thread wuyi (Jira)

wuyi created SPARK-31082:


 Summary: MapOutputTrackerMaster.getMapLocation can't handle last 
mapIndex
 Key: SPARK-31082
 URL: https://issues.apache.org/jira/browse/SPARK-31082
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: wuyi


Currently, for the last mapIndex, MapOutputTrackerMaster.getMapLocation always 
returns empty locations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31073) Add "shuffle write time" to task metrics summary in StagePage.

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31073.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27837
[https://github.com/apache/spark/pull/27837]

> Add "shuffle write time" to task metrics summary in StagePage.
> --
>
> Key: SPARK-31073
> URL: https://issues.apache.org/jira/browse/SPARK-31073
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> In StagePage, "shuffle write time" is not shown in task metrics summary even 
> though "shuffle read blocked time" is shown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31083) .ClassNotFoundException CoarseGrainedClusterMessages$RetrieveDelegationTokens

2020-03-08 Thread jiama (Jira)

jiama created SPARK-31083:
-

 Summary: .ClassNotFoundException 
CoarseGrainedClusterMessages$RetrieveDelegationTokens
 Key: SPARK-31083
 URL: https://issues.apache.org/jira/browse/SPARK-31083
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
 Environment: spark2.4-cdh6.2
Reporter: jiama


Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RetrieveDelegationTokens$



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31084) spark on k8s Exception "Database xxx not found" when hive MetaStoreClient lost connection

2020-03-08 Thread liuxiuyuan (Jira)

liuxiuyuan created SPARK-31084:
--

 Summary: spark on k8s Exception "Database xxx not found" when hive 
MetaStoreClient lost connection
 Key: SPARK-31084
 URL: https://issues.apache.org/jira/browse/SPARK-31084
 Project: Spark
  Issue Type: Question
  Components: Kubernetes
Affects Versions: 2.4.4
 Environment: spark 2.4.4

 

 

 
Reporter: liuxiuyuan


 
06-03-2020 12:55:59 CST stage1 INFO - 2020-03-06 04:39:00 [INFO] -- 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.:108 | File 
Output Committer Algorithm version is 106-03-2020 12:55:59 CST stage1 INFO - 
06-03-2020 12:55:59 CST stage1 INFO - 2020-03-06 04:56:05 [WARN] -- 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke:184 | 
MetaStoreClient lost connection. Attempting to reconnect.06-03-2020 12:55:59 
CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO - 
org.apache.thrift.transport.TTransportException: java.net.SocketException: 
Connection reset06-03-2020 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST 
stage1 INFO - at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -  at 
org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)06-03-2020 
12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -  at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -  at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO - at 
org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)06-03-2020 
12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -  at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_database(ThriftHiveMetastore.java:654)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -   at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_database(ThriftHiveMetastore.java:641)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:1158)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -   at 
sun.reflect.GeneratedMethodAccessor100.invoke(Unknown Source)06-03-2020 
12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO - at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO - at 
java.lang.reflect.Method.invoke(Method.java:498)06-03-2020 12:55:59 CST stage1 
INFO - 06-03-2020 12:55:59 CST stage1 INFO -  at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:156)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO - at 
com.sun.proxy.$Proxy38.getDatabase(Unknown Source)06-03-2020 12:55:59 CST 
stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -at 
org.apache.hadoop.hive.ql.metadata.Hive.getDatabase(Hive.java:1301)06-03-2020 
12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -   at 
org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1290)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$databaseExists$1.apply$mcZ$sp(HiveClientImpl.scala:349)06-03-2020
 12:56:09 CST stage1 INFO - 06-03-2020 12:56:09 CST stage1 INFO -  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$databaseExists$1.apply(HiveClientImpl.scala:349)06-03-2020
 12:56:09 CST stage1 INFO - 06-03-2020 12:56:09 CST stage1 INFO - at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$databaseExists$1.apply(HiveClientImpl.scala:349)06-03-2020
 12:56:09 CST stage1 INFO - 06-03-2020 12:56:09 CST stage1 INFO - at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)06-03-2020
 12:56:09 CST stage1 INFO - 06-03-2020 12:56:09 CST stage1 INFO -  at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)06-03-2020
 12:56:09 CST stage1 INFO - 06-03-2020 12:56:09 CST stage1 INFO -   at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)06-03-2020
 12:56:09 CST stage1 INFO - 06-03-2020 12:56:09 CST stage1 INFO - at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)06-03-2020
 12:56:09 CST stage1 INFO -

[jira] [Updated] (SPARK-31084) spark on k8s Exception "Database xxx not found" when hive MetaStoreClient lost connection

2020-03-08 Thread liuxiuyuan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxiuyuan updated SPARK-31084:
---
Description: 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke:184 | 
MetaStoreClient lost connection. Attempting to 
reconnect.org.apache.thrift.transport.TTransportException: 
java.net.SocketException: Connection reset
06-03-2020 12:56:09 CST stage1 INFO - Caused by: java.net.SocketException: 
Connection reset
06-03-2020 12:56:09 CST stage1 INFO - 
06-03-2020 12:56:09 CST stage1 INFO -   at 
java.net.SocketInputStream.read(SocketInputStream.java:210)
06-03-2020 12:56:09 CST stage1 INFO - 
06-03-2020 12:56:09 CST stage1 INFO -   at 
java.net.SocketInputStream.read(SocketInputStream.java:141)
06-03-2020 12:56:09 CST stage1 INFO - 
06-03-2020 12:56:09 CST stage1 INFO -   at 
java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
06-03-2020 12:56:09 CST stage1 INFO - 
06-03-2020 12:56:09 CST stage1 INFO -   at 
java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
06-03-2020 12:56:09 CST stage1 INFO - 
06-03-2020 12:56:09 CST stage1 INFO -   at 
java.io.BufferedInputStream.read(BufferedInputStream.java:345)
06-03-2020 12:56:09 CST stage1 INFO - 
06-03-2020 12:56:09 CST stage1 INFO -   at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
06-03-2020 12:56:09 CST stage1 INFO - 
06-03-2020 12:56:09 CST stage1 INFO -   ... 70 more

06-03-2020 12:56:09 CST stage1 INFO - 2020-03-06 04:56:06 [INFO] -- 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open:376 | Trying to 
connect to metastore with URI thrift://hive-metastore-server:9083
06-03-2020 12:56:09 CST stage1 INFO - 
06-03-2020 12:56:09 CST stage1 INFO - 2020-03-06 04:56:06 [INFO] -- 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open:472 | Connected to 
metastore.
06-03-2020 12:56:09 CST stage1 INFO - 
06-03-2020 12:56:09 CST stage1 INFO - 2020-03-06 04:56:14 [WARN] -- 
org.apache.spark.internal.Logging$class.logWarning:87 | Kubernetes client has 
been closed (this is expected if the application is shutting down.)
06-03-2020 12:56:09 CST stage1 INFO - 2020-03-06 04:56:14 [WARN] -- 
org.apache.spark.internal.Logging$class.logWarning:87 | Kubernetes client has 
been closed (this is expected if the application is shutting down.)
06-03-2020 12:56:09 CST stage1 INFO - 
06-03-2020 12:56:09 CST stage1 INFO - Exception in thread "main" 
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'xxx' 
not found;
06-03-2020 12:56:09 CST stage1 INFO - 
06-03-2020 12:56:09 CST stage1 INFO -   at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireDbExists(SessionCatalog.scala:178)

  was:
 
06-03-2020 12:55:59 CST stage1 INFO - 2020-03-06 04:39:00 [INFO] -- 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.:108 | File 
Output Committer Algorithm version is 106-03-2020 12:55:59 CST stage1 INFO - 
06-03-2020 12:55:59 CST stage1 INFO - 2020-03-06 04:56:05 [WARN] -- 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke:184 | 
MetaStoreClient lost connection. Attempting to reconnect.06-03-2020 12:55:59 
CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO - 
org.apache.thrift.transport.TTransportException: java.net.SocketException: 
Connection reset06-03-2020 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST 
stage1 INFO - at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -  at 
org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)06-03-2020 
12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -  at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -  at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO - at 
org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)06-03-2020 
12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -  at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_database(ThriftHiveMetastore.java:654)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -   at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_database(ThriftHiveMetastore.java:641)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:59 CST stage1 INFO -at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getDatabase(HiveMetaStoreClient.java:1158)06-03-2020
 12:55:59 CST stage1 INFO - 06-03-2020 12:55:5

[jira] [Comment Edited] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER

2020-03-08 Thread Sunayan Saikia (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054482#comment-17054482
 ] 

Sunayan Saikia edited comment on SPARK-20427 at 3/9/20, 3:58 AM:
-

[~yumwang] 
Seems this fix broke the way we could get the column name with the _'name'_  
key via the MetadataBuiler map inside getCatalystType()
 Is there a way I could get the column name now while I'm overriding the 
getCatalystType() method?

Please check the Java code below for which things broke.
 public Option getCatalystType(int sqlJdbcType, String typeName, int 
size, MetadataBuilder md) {
 String columnName = String.valueOf(md.getMap().get("name").get());


was (Author: sunayansaikia):
Seems this fix broke the way we could get the column name with the _'name'_  
key via the MetadataBuiler map inside getCatalystType()
Is there a way I could get the column name now while I'm overriding the 
getCatalystType() method?

Please check the Java code below for which things broke.
public Option getCatalystType(int sqlJdbcType, String typeName, int 
size, MetadataBuilder md) {
  String columnName = String.valueOf(md.getMap().get("name").get());

> Issue with Spark interpreting Oracle datatype NUMBER
> 
>
> Key: SPARK-20427
> URL: https://issues.apache.org/jira/browse/SPARK-20427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Alexander Andrushenko
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> In Oracle exists data type NUMBER. When defining a filed in a table of type 
> NUMBER the field has two components, precision and scale.
> For example, NUMBER(p,s) has precision p and scale s. 
> Precision can range from 1 to 38.
> Scale can range from -84 to 127.
> When reading such a filed Spark can create numbers with precision exceeding 
> 38. In our case it has created fields with precision 44,
> calculated as sum of the precision (in our case 34 digits) and the scale (10):
> "...java.lang.IllegalArgumentException: requirement failed: Decimal precision 
> 44 exceeds max precision 38...".
> The result was, that a data frame was read from a table on one schema but 
> could not be inserted in the identical table on other schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31085) Amend Spark's Semantic Versioning Policy

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31085:
--
Issue Type: Bug  (was: Improvement)

> Amend Spark's Semantic Versioning Policy
> 
>
> Key: SPARK-31085
> URL: https://issues.apache.org/jira/browse/SPARK-31085
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31085) Amend Spark's Semantic Versioning Policy

2020-03-08 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-31085:
-

 Summary: Amend Spark's Semantic Versioning Policy
 Key: SPARK-31085
 URL: https://issues.apache.org/jira/browse/SPARK-31085
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core, SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31085) Amend Spark's Semantic Versioning Policy

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31085:
--
Reporter: Michael Armbrust  (was: Dongjoon Hyun)

> Amend Spark's Semantic Versioning Policy
> 
>
> Key: SPARK-31085
> URL: https://issues.apache.org/jira/browse/SPARK-31085
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Michael Armbrust
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31085) Amend Spark's Semantic Versioning Policy

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31085:
-

Assignee: Michael Armbrust

> Amend Spark's Semantic Versioning Policy
> 
>
> Key: SPARK-31085
> URL: https://issues.apache.org/jira/browse/SPARK-31085
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31085) Amend Spark's Semantic Versioning Policy

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31085:
--
Description: 
This issue tracks all the activity for the following discussion and vote.
- 
https://lists.apache.org/thread.html/r82f99ad8c2798629eed66d65f2cddc1ed196dddf82e8e9370f3b7d32%40%3Cdev.spark.apache.org%3E

- 
https://lists.apache.org/thread.html/r683dbb0481adb1944461b6e1a60aafc44a66423c6e9fa2bab24a07db%40%3Cdev.spark.apache.org%3E

> Amend Spark's Semantic Versioning Policy
> 
>
> Key: SPARK-31085
> URL: https://issues.apache.org/jira/browse/SPARK-31085
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> This issue tracks all the activity for the following discussion and vote.
> - 
> https://lists.apache.org/thread.html/r82f99ad8c2798629eed66d65f2cddc1ed196dddf82e8e9370f3b7d32%40%3Cdev.spark.apache.org%3E
> - 
> https://lists.apache.org/thread.html/r683dbb0481adb1944461b6e1a60aafc44a66423c6e9fa2bab24a07db%40%3Cdev.spark.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31085) Amend Spark's Semantic Versioning Policy

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31085:
--
Issue Type: Umbrella  (was: Bug)

> Amend Spark's Semantic Versioning Policy
> 
>
> Key: SPARK-31085
> URL: https://issues.apache.org/jira/browse/SPARK-31085
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> This issue tracks all the activity for the following discussion and vote.
> - 
> https://lists.apache.org/thread.html/r82f99ad8c2798629eed66d65f2cddc1ed196dddf82e8e9370f3b7d32%40%3Cdev.spark.apache.org%3E
> - 
> https://lists.apache.org/thread.html/r683dbb0481adb1944461b6e1a60aafc44a66423c6e9fa2bab24a07db%40%3Cdev.spark.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31086) Add Back the Deprecated SQLContext methods

2020-03-08 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-31086:
-

 Summary: Add Back the Deprecated SQLContext methods
 Key: SPARK-31086
 URL: https://issues.apache.org/jira/browse/SPARK-31086
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31086) Add Back the Deprecated SQLContext methods

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31086:
--
Reporter: Xiao Li  (was: Dongjoon Hyun)

> Add Back the Deprecated SQLContext methods
> --
>
> Key: SPARK-31086
> URL: https://issues.apache.org/jira/browse/SPARK-31086
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31086) Add Back the Deprecated SQLContext methods

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31086:
-

Assignee: Xiao Li

> Add Back the Deprecated SQLContext methods
> --
>
> Key: SPARK-31086
> URL: https://issues.apache.org/jira/browse/SPARK-31086
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31087) Add Back Multiple Removed APIs

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31087:
--
Reporter: Xiao Li  (was: Dongjoon Hyun)

> Add Back Multiple Removed APIs
> --
>
> Key: SPARK-31087
> URL: https://issues.apache.org/jira/browse/SPARK-31087
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31087) Add Back Multiple Removed APIs

2020-03-08 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-31087:
-

 Summary: Add Back Multiple Removed APIs
 Key: SPARK-31087
 URL: https://issues.apache.org/jira/browse/SPARK-31087
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31087) Add Back Multiple Removed APIs

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31087:
-

Assignee: Xiao Li

> Add Back Multiple Removed APIs
> --
>
> Key: SPARK-31087
> URL: https://issues.apache.org/jira/browse/SPARK-31087
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31088) Add back HiveContext and createExternalTable

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31088:
--
Reporter: Xiao Li  (was: Dongjoon Hyun)

> Add back HiveContext and createExternalTable
> 
>
> Key: SPARK-31088
> URL: https://issues.apache.org/jira/browse/SPARK-31088
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31088) Add back HiveContext and createExternalTable

2020-03-08 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-31088:
-

 Summary: Add back HiveContext and createExternalTable
 Key: SPARK-31088
 URL: https://issues.apache.org/jira/browse/SPARK-31088
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31088) Add back HiveContext and createExternalTable

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31088:
-

Assignee: Xiao Li

> Add back HiveContext and createExternalTable
> 
>
> Key: SPARK-31088
> URL: https://issues.apache.org/jira/browse/SPARK-31088
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31089) Add back ImageSchema.readImages in Spark 3.0

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31089:
--
Reporter: Weichen Xu  (was: Dongjoon Hyun)

> Add back ImageSchema.readImages in Spark 3.0
> 
>
> Key: SPARK-31089
> URL: https://issues.apache.org/jira/browse/SPARK-31089
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31089) Add back ImageSchema.readImages in Spark 3.0

2020-03-08 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-31089:
-

 Summary: Add back ImageSchema.readImages in Spark 3.0
 Key: SPARK-31089
 URL: https://issues.apache.org/jira/browse/SPARK-31089
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-31089) Add back ImageSchema.readImages in Spark 3.0

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-31089.
-

> Add back ImageSchema.readImages in Spark 3.0
> 
>
> Key: SPARK-31089
> URL: https://issues.apache.org/jira/browse/SPARK-31089
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31089) Add back ImageSchema.readImages in Spark 3.0

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31089.
---
Resolution: Won't Do

> Add back ImageSchema.readImages in Spark 3.0
> 
>
> Key: SPARK-31089
> URL: https://issues.apache.org/jira/browse/SPARK-31089
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-31054) Turn on deprecation in Scala REPL/spark-shell by default

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-31054:
---

> Turn on deprecation in Scala REPL/spark-shell  by default
> -
>
> Key: SPARK-31054
> URL: https://issues.apache.org/jira/browse/SPARK-31054
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Shell
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> Turn on deprecation in Scala REPL/spark-shell by default, so user can aways 
> see the details about deprecated API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31054) Turn on deprecation in Scala REPL/spark-shell by default

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31054:
--
Parent: SPARK-31085
Issue Type: Sub-task  (was: Improvement)

> Turn on deprecation in Scala REPL/spark-shell  by default
> -
>
> Key: SPARK-31054
> URL: https://issues.apache.org/jira/browse/SPARK-31054
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Shell
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> Turn on deprecation in Scala REPL/spark-shell by default, so user can aways 
> see the details about deprecated API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31054) Turn on deprecation in Scala REPL/spark-shell by default

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31054.
---
Resolution: Invalid

The approach was invalid because it's too intrusive by turning all 
Java/Scala/library warnings on.

> Turn on deprecation in Scala REPL/spark-shell  by default
> -
>
> Key: SPARK-31054
> URL: https://issues.apache.org/jira/browse/SPARK-31054
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Shell
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> Turn on deprecation in Scala REPL/spark-shell by default, so user can aways 
> see the details about deprecated API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-31054) Turn on deprecation in Scala REPL/spark-shell by default

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-31054.
-

> Turn on deprecation in Scala REPL/spark-shell  by default
> -
>
> Key: SPARK-31054
> URL: https://issues.apache.org/jira/browse/SPARK-31054
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Shell
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> Turn on deprecation in Scala REPL/spark-shell by default, so user can aways 
> see the details about deprecated API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30886) Deprecate two-parameter TRIM/LTRIM/RTRIM functions

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30886:
--
Parent: SPARK-31085
Issue Type: Sub-task  (was: Bug)

> Deprecate two-parameter TRIM/LTRIM/RTRIM functions
> --
>
> Key: SPARK-30886
> URL: https://issues.apache.org/jira/browse/SPARK-30886
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Apache Spark community decided to keep the existing esoteric two-parameter 
> use cases with a proper warning. This JIRA aims to show warning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31090) Revert SPARK-25457 IntegralDivide returns data type of the operands

2020-03-08 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-31090:
-

 Summary: Revert SPARK-25457 IntegralDivide returns data type of 
the operands
 Key: SPARK-31090
 URL: https://issues.apache.org/jira/browse/SPARK-31090
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31091) Revert SPARK-24640 "Return `NULL` from `size(NULL)` by default"

2020-03-08 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-31091:
-

 Summary: Revert SPARK-24640 "Return `NULL` from `size(NULL)` by 
default"
 Key: SPARK-31091
 URL: https://issues.apache.org/jira/browse/SPARK-31091
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31090) Revert SPARK-25457 IntegralDivide returns data type of the operands

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31090:
--
Reporter: Wenchen Fan  (was: Dongjoon Hyun)

> Revert SPARK-25457 IntegralDivide returns data type of the operands
> ---
>
> Key: SPARK-31090
> URL: https://issues.apache.org/jira/browse/SPARK-31090
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31090) Revert SPARK-25457 IntegralDivide returns data type of the operands

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31090:
-

Assignee: Wenchen Fan

> Revert SPARK-25457 IntegralDivide returns data type of the operands
> ---
>
> Key: SPARK-31090
> URL: https://issues.apache.org/jira/browse/SPARK-31090
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31091) Revert SPARK-24640 "Return `NULL` from `size(NULL)` by default"

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31091:
--
Reporter: Wenchen Fan  (was: Dongjoon Hyun)

> Revert SPARK-24640 "Return `NULL` from `size(NULL)` by default"
> ---
>
> Key: SPARK-31091
> URL: https://issues.apache.org/jira/browse/SPARK-31091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31091) Revert SPARK-24640 "Return `NULL` from `size(NULL)` by default"

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31091:
-

Assignee: Wenchen Fan

> Revert SPARK-24640 "Return `NULL` from `size(NULL)` by default"
> ---
>
> Key: SPARK-31091
> URL: https://issues.apache.org/jira/browse/SPARK-31091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30960) add back the legacy date/timestamp format support in CSV/JSON parser

2020-03-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30960:
--
Parent: SPARK-31085
Issue Type: Sub-task  (was: Bug)

> add back the legacy date/timestamp format support in CSV/JSON parser
> 
>
> Key: SPARK-30960
> URL: https://issues.apache.org/jira/browse/SPARK-30960
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()

2020-03-08 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054674#comment-17054674
 ] 

Nicholas Chammas commented on SPARK-31065:
--

Thanks for looking into it.

I have a silly question: Why isn't {{schema_of_json()}} simply syntactic sugar 
for {{spark.read.json()}}?

For example:
{code:python}
def _schema_of_json(string):
df = spark.read.json(spark.sparkContext.parallelize([string]))
return df.schema{code}
Perhaps there are practical reasons not to do this, but conceptually speaking 
this kind of equivalence should hold. Yet this bug report demonstrates that 
they are not equivalent.
{code:python}
from pyspark.sql.functions import from_json, schema_of_json
json = '{"a": ""}'

df = spark.createDataFrame([(json,)], schema=['json'])
df.show()

# chokes with org.apache.spark.sql.catalyst.parser.ParseException
json_schema = schema_of_json(json)
df.select(from_json('json', json_schema))

# works fine
json_schema = _schema_of_json(json)
df.select(from_json('json', json_schema)) {code}

> Empty string values cause schema_of_json() to return a schema not usable by 
> from_json()
> ---
>
> Key: SPARK-31065
> URL: https://issues.apache.org/jira/browse/SPARK-31065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a reproduction:
>   
> {code:python}
> from pyspark.sql.functions import from_json, schema_of_json
> json = '{"a": ""}'
> df = spark.createDataFrame([(json,)], schema=['json'])
> df.show()
> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> json_schema = schema_of_json(json)
> df.select(from_json('json', json_schema))
> # works fine
> json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema
> df.select(from_json('json', json_schema))
> {code}
> The output:
> {code:java}
> >>> from pyspark.sql.functions import from_json, schema_of_json
> >>> json = '{"a": ""}'
> >>> 
> >>> df = spark.createDataFrame([(json,)], schema=['json'])
> >>> df.show()
> +-+
> | json|
> +-+
> |{"a": ""}|
> +-+
> >>> 
> >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> >>> json_schema = schema_of_json(json)
> >>> df.select(from_json('json', json_schema))
> Traceback (most recent call last):
>   File ".../site-packages/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File 
> ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.functions.from_json.
> : org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 
> 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
> 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
> 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
> 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 
> 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 
> 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 
> 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 
> 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
> 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
> 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 
> 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 
> 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 
> 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 
> 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 
> 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
> 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 
> 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 
> 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 
> 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 
> 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 
> 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 
> 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'M

[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master

2020-03-08 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054676#comment-17054676
 ] 

Nicholas Chammas commented on SPARK-31043:
--

It's working for me now (per my comment), but when I was seeing the issue 
simply starting a PySpark shell was enough to throw that error. Do you still 
need a reproduction?

> Spark 3.0 built against hadoop2.7 can't start standalone master
> ---
>
> Key: SPARK-31043
> URL: https://issues.apache.org/jira/browse/SPARK-31043
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Critical
> Fix For: 3.0.0
>
>
> trying to start a standalone master when building spark branch 3.0 with 
> hadoop2.7 fails with:
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/w3c/dom/ElementTraversal
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.defineClass(URLClassLoader.java:468)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.access$100(URLClassLoader.java:74)
> at 
> [java.net|http://java.net/]
> .URLClassLoader$1.run(URLClassLoader.java:369)
> ...
> Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
> at 
> [java.net|http://java.net/]
> .URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> ... 42 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31065) Empty string values cause schema_of_json() to return a schema not usable by from_json()

2020-03-08 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054683#comment-17054683
 ] 

Hyukjin Kwon commented on SPARK-31065:
--

Oh, nice catch. Yes, {{null}} types inferred for each field is replaced to 
{{string}} type through 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JsonInferSchema.scala#L380
 which isn't called in {{schema_of_json}}.

Let me make a quick fix.

> Empty string values cause schema_of_json() to return a schema not usable by 
> from_json()
> ---
>
> Key: SPARK-31065
> URL: https://issues.apache.org/jira/browse/SPARK-31065
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a reproduction:
>   
> {code:python}
> from pyspark.sql.functions import from_json, schema_of_json
> json = '{"a": ""}'
> df = spark.createDataFrame([(json,)], schema=['json'])
> df.show()
> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> json_schema = schema_of_json(json)
> df.select(from_json('json', json_schema))
> # works fine
> json_schema = spark.read.json(df.rdd.map(lambda x: x[0])).schema
> df.select(from_json('json', json_schema))
> {code}
> The output:
> {code:java}
> >>> from pyspark.sql.functions import from_json, schema_of_json
> >>> json = '{"a": ""}'
> >>> 
> >>> df = spark.createDataFrame([(json,)], schema=['json'])
> >>> df.show()
> +-+
> | json|
> +-+
> |{"a": ""}|
> +-+
> >>> 
> >>> # chokes with org.apache.spark.sql.catalyst.parser.ParseException
> >>> json_schema = schema_of_json(json)
> >>> df.select(from_json('json', json_schema))
> Traceback (most recent call last):
>   File ".../site-packages/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File 
> ".../site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", 
> line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.functions.from_json.
> : org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input '<' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 
> 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 
> 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 
> 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 
> 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 
> 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 
> 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 
> 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 
> 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 
> 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 
> 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 
> 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 
> 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 
> 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 
> 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 
> 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 
> 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 
> 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 
> 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 
> 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 
> 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 
> 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 
> 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 
> 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 
> 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 
> 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 
> 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 
> 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 
> 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 
> 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 6)
> == SQL ==
> struct
> --^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.sc

[jira] [Commented] (SPARK-30983) Support more than 5 typed column in typed Dataset.select API

2020-03-08 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054692#comment-17054692
 ] 

L. C. Hsieh commented on SPARK-30983:
-

For option 2, adding more overloading typed select APIs, one issue is that it 
can be a breaking change to existing user code that calls untyped select API. 

> Support more than 5 typed column in typed Dataset.select API
> 
>
> Key: SPARK-30983
> URL: https://issues.apache.org/jira/browse/SPARK-30983
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Because Dataset only provides overloading typed select API to at most 5 typed 
> columns, once more than 5 typed columns given, the select API call will go 
> for untyped one.
> Currently users cannot call typed select with more than 5 typed columns. 
> There are few options:
> 1. Expose Dataset.selectUntyped (could rename it) to accept any number (due 
> to the limit of ExpressionEncoder.tuple, at most 22 actually) of typed 
> columns. Pros: not need to add too much code in Dataset. Cons: The returned 
> type is generally Dataset[_], not a specified one like Dataset[(U1, U2)] for 
> the overloading method.
> 2. Add more overloading typed select APIs up to 22 typed column inputs. Pros: 
> Clear returned type. Cons: A lot of code to be added to Dataset for just 
> corner cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column

2020-03-08 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054697#comment-17054697
 ] 

L. C. Hsieh commented on SPARK-31074:
-

Based on the description, is this the same issue as you created at SPARK-31071?

> Avro serializer should not fail when a nullable Spark field is written to a 
> non-null Avro column
> 
>
> Key: SPARK-31074
> URL: https://issues.apache.org/jira/browse/SPARK-31074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark StructType schema are strongly biased towards having _nullable_ fields. 
> In fact, this is what _Encoders.bean()_ does - any non-primitive field is 
> automatically _nullable_. When we attempt to serialize dataframes into 
> *user-supplied* Avro schemas where such corresponding fields are marked as 
> _non-null_ (i.e., they are not of _union_ type) any such attempt will fail 
> with the following exception
>  
> {code:java}
> Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string"
>   at org.apache.avro.Schema.getTypes(Schema.java:299)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
>   at 
> org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
>  {code}
> This seems as rather draconian. We certainly should be able to write a field 
> of the same type and with the same name if it is not a null into a 
> non-nullable Avro column. In fact, the problem is so *severe* that it is not 
> clear what should be done in such situations when Avro schema is given to you 
> as part of API communication contract (i.e., it is non-changeable).
> This is an important issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31083) .ClassNotFoundException CoarseGrainedClusterMessages$RetrieveDelegationTokens

2020-03-08 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054703#comment-17054703
 ] 

L. C. Hsieh commented on SPARK-31083:
-

Can you tell more about this issue? I think RetrieveDelegationTokens is only 
available in 3.0, are you using a Spark version conflicting with the Spark 
version running on the cluster?

> .ClassNotFoundException CoarseGrainedClusterMessages$RetrieveDelegationTokens
> -
>
> Key: SPARK-31083
> URL: https://issues.apache.org/jira/browse/SPARK-31083
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: spark2.4-cdh6.2
>Reporter: jiama
>Priority: Major
>
> Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RetrieveDelegationTokens$



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

61 matches

Mail list logo