[jira] [Commented] (SPARK-30196) Bump lz4-java version to 1.7.0

2020-01-07 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010438#comment-17010438
 ] 

Hyukjin Kwon commented on SPARK-30196:
--

(y)

> Bump lz4-java version to 1.7.0
> --
>
> Key: SPARK-30196
> URL: https://issues.apache.org/jira/browse/SPARK-30196
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30425) FileScan of Data Source V2 doesn't implement Partition Pruning

2020-01-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30425.
--
Resolution: Duplicate

> FileScan of Data Source V2 doesn't implement Partition Pruning
> --
>
> Key: SPARK-30425
> URL: https://issues.apache.org/jira/browse/SPARK-30425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Haifeng Chen
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I was trying to understand how Data Source V2 handling partition pruning,  I 
> didn't find the code anywhere which filtering out the unnecessary files in 
> current Data Source V2 implementation. For a File data source, the base class 
> FileScan of Data Source V2 possibly should handle this in "partitions" 
> method. But the current implementation is like the following:
> protected def partitions: Seq[FilePartition] = {
>  val selectedPartitions = fileIndex.listFiles(Seq.empty, Seq.empty)
>  
> listFiles passed to empty sequence where no files will be filtered by the 
> partition filter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30455) Select All should unselect after un-selecting any selected item from list.

2020-01-07 Thread Ankit Raj Boudh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010424#comment-17010424
 ] 

Ankit Raj Boudh commented on SPARK-30455:
-

i will raise for this.

> Select All should unselect after un-selecting any selected item from list.
> --
>
> Key: SPARK-30455
> URL: https://issues.apache.org/jira/browse/SPARK-30455
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Ankit Raj Boudh
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30455) Select All should unselect after un-selecting any selected item from list.

2020-01-07 Thread Ankit Raj Boudh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010424#comment-17010424
 ] 

Ankit Raj Boudh edited comment on SPARK-30455 at 1/8/20 7:06 AM:
-

i will raise pr for this.


was (Author: ankitraj):
i will raise for this.

> Select All should unselect after un-selecting any selected item from list.
> --
>
> Key: SPARK-30455
> URL: https://issues.apache.org/jira/browse/SPARK-30455
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: Ankit Raj Boudh
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30455) Select All should unselect after un-selecting any selected item from list.

2020-01-07 Thread Ankit Raj Boudh (Jira)
Ankit Raj Boudh created SPARK-30455:
---

 Summary: Select All should unselect after un-selecting any 
selected item from list.
 Key: SPARK-30455
 URL: https://issues.apache.org/jira/browse/SPARK-30455
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.4
Reporter: Ankit Raj Boudh






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30425) FileScan of Data Source V2 doesn't implement Partition Pruning

2020-01-07 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010414#comment-17010414
 ] 

Gengliang Wang commented on SPARK-30425:


[~sandeep.katta2007]Yes
[~jerrychenhf]  Thanks for reporting the issue! Do you mind if I close this one?

> FileScan of Data Source V2 doesn't implement Partition Pruning
> --
>
> Key: SPARK-30425
> URL: https://issues.apache.org/jira/browse/SPARK-30425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Haifeng Chen
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I was trying to understand how Data Source V2 handling partition pruning,  I 
> didn't find the code anywhere which filtering out the unnecessary files in 
> current Data Source V2 implementation. For a File data source, the base class 
> FileScan of Data Source V2 possibly should handle this in "partitions" 
> method. But the current implementation is like the following:
> protected def partitions: Seq[FilePartition] = {
>  val selectedPartitions = fileIndex.listFiles(Seq.empty, Seq.empty)
>  
> listFiles passed to empty sequence where no files will be filtered by the 
> partition filter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30454) Null Dereference in HiveSQLException

2020-01-07 Thread pavithra ramachandran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010412#comment-17010412
 ] 

pavithra ramachandran commented on SPARK-30454:
---

I shall raise the PR

> Null Dereference in HiveSQLException
> 
>
> Key: SPARK-30454
> URL: https://issues.apache.org/jira/browse/SPARK-30454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: pavithra ramachandran
>Priority: Major
>
> Null Pointer DeReferencing  found in spark HiveSQLException code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30454) Null Dereference in HiveSQLException

2020-01-07 Thread pavithra ramachandran (Jira)
pavithra ramachandran created SPARK-30454:
-

 Summary: Null Dereference in HiveSQLException
 Key: SPARK-30454
 URL: https://issues.apache.org/jira/browse/SPARK-30454
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4, 2.3.4, 3.0.0
Reporter: pavithra ramachandran


Null Pointer DeReferencing  found in spark HiveSQLException code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30425) FileScan of Data Source V2 doesn't implement Partition Pruning

2020-01-07 Thread Sandeep Katta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010408#comment-17010408
 ] 

Sandeep Katta commented on SPARK-30425:
---

[~gengliang]

> FileScan of Data Source V2 doesn't implement Partition Pruning
> --
>
> Key: SPARK-30425
> URL: https://issues.apache.org/jira/browse/SPARK-30425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Haifeng Chen
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I was trying to understand how Data Source V2 handling partition pruning,  I 
> didn't find the code anywhere which filtering out the unnecessary files in 
> current Data Source V2 implementation. For a File data source, the base class 
> FileScan of Data Source V2 possibly should handle this in "partitions" 
> method. But the current implementation is like the following:
> protected def partitions: Seq[FilePartition] = {
>  val selectedPartitions = fileIndex.listFiles(Seq.empty, Seq.empty)
>  
> listFiles passed to empty sequence where no files will be filtered by the 
> partition filter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30425) FileScan of Data Source V2 doesn't implement Partition Pruning

2020-01-07 Thread Sandeep Katta (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010407#comment-17010407
 ] 

Sandeep Katta commented on SPARK-30425:
---

is this duplicate of 
[SPARK-30428|https://issues.apache.org/jira/browse/SPARK-30428]

> FileScan of Data Source V2 doesn't implement Partition Pruning
> --
>
> Key: SPARK-30425
> URL: https://issues.apache.org/jira/browse/SPARK-30425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Haifeng Chen
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> I was trying to understand how Data Source V2 handling partition pruning,  I 
> didn't find the code anywhere which filtering out the unnecessary files in 
> current Data Source V2 implementation. For a File data source, the base class 
> FileScan of Data Source V2 possibly should handle this in "partitions" 
> method. But the current implementation is like the following:
> protected def partitions: Seq[FilePartition] = {
>  val selectedPartitions = fileIndex.listFiles(Seq.empty, Seq.empty)
>  
> listFiles passed to empty sequence where no files will be filtered by the 
> partition filter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28478) Optimizer rule to remove unnecessary explicit null checks for null-intolerant expressions (e.g. if(x is null, x, f(x)))

2020-01-07 Thread David Vrba (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010387#comment-17010387
 ] 

David Vrba commented on SPARK-28478:


[~cloud_fan] what do you think about this? Is it worth implementing? If yes, I 
would like to do it. If not i won't bother.

> Optimizer rule to remove unnecessary explicit null checks for null-intolerant 
> expressions (e.g. if(x is null, x, f(x)))
> ---
>
> Key: SPARK-28478
> URL: https://issues.apache.org/jira/browse/SPARK-28478
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer, SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> I ran across a family of expressions like
> {code:java}
> if(x is null, x, substring(x, 0, 1024)){code}
> or 
> {code:java}
> when($"x".isNull, $"x", substring($"x", 0, 1024)){code}
> that were written this way because the query author was unsure about whether 
> {{substring}} would return {{null}} when its input string argument is null.
> This explicit null-handling is unnecessary and adds bloat to the generated 
> code, especially if it's done via a {{CASE}} statement (which compiles down 
> to a {{do-while}} loop).
> In another case I saw a query compiler which automatically generated this 
> type of code.
> It would be cool if Spark could automatically optimize such queries to remove 
> these redundant null checks. Here's a sketch of what such a rule might look 
> like (assuming that SPARK-28477 has been implement so we only need to worry 
> about the {{IF}} case):
>  * In the pattern match, check the following three conditions in the 
> following order (to benefit from short-circuiting)
>  ** The {{IF}} condition is an explicit null-check of a column {{c}}
>  ** The {{true}} expression returns either {{c}} or {{null}}
>  ** The {{false}} expression is a _null-intolerant_ expression with {{c}} as 
> a _direct_ child. 
>  * If this condition matches, replace the entire {{If}} with the {{false}} 
> branch's expression..
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30408) orderBy in sortBy clause is removed by EliminateSorts

2020-01-07 Thread APeng Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

APeng Zhang updated SPARK-30408:

Description: 
OrderBy in sortBy clause will be removed by EliminateSorts.

code to reproduce:
{code:java}
val dataset = Seq( ("a", 1, 4), ("b", 2, 5), ("b2", 2, 2), ("c", 3, 6) 
).toDF("a", "b", "c") 
val groupData = dataset.orderBy("b")
val sortData = groupData.sortWithinPartitions("c")
{code}
The content of groupData is:
{code:java}
partition 0: 
[a,1,4]
partition 1: 
[b,2,5]
[b2,2,2]
partition 2: 
[c,3,6]{code}
The content of sortData is:
{code:java}
partition 0: 
[a,1,4]
[b,2,5]
partition 1: 
[b2,2,2]
[c,3,6]{code}
 

UT to cover this defect:

In EliminateSortsSuite.scala
{code:java}
test("should not remove orderBy in sortBy clause") {
  val plan = testRelation.orderBy('a.asc).sortBy('b.desc)
  val optimized = Optimize.execute(plan.analyze)
  val correctAnswer = testRelation.orderBy('a.asc).sortBy('b.desc).analyze
  comparePlans(optimized, correctAnswer)
}{code}
 

 
 This test will be failed because sortBy was removed by EliminateSorts.

  was:
OrderBy in sortBy clause will be removed by EliminateSorts.

code to reproduce:
{code:java}
val dataset = Seq( ("a", 1, 4), ("b", 2, 5), ("c", 3, 6) ).toDF("a", "b", "c") 
val groupData = dataset.orderBy("b")
val sortData = groupData.sortWithinPartitions("c")
{code}
The content of groupData is:
{code:java}
partition 0: 
[a,1,4]
partition 1: 
[b,2,5]
partition 2: 
[c,3,6]{code}
The content of sortData is:
{code:java}
partition 0: 
[a,1,4]
partition 1: 
[b,2,5], 
[c,3,6]{code}
 

UT to cover this defect:

In EliminateSortsSuite.scala
{code:java}
test("should not remove orderBy in sortBy clause") {
  val plan = testRelation.orderBy('a.asc).sortBy('b.desc)
  val optimized = Optimize.execute(plan.analyze)
  val correctAnswer = testRelation.orderBy('a.asc).sortBy('b.desc).analyze
  comparePlans(optimized, correctAnswer)
}{code}
 

 
 This test will be failed because sortBy was removed by EliminateSorts.


> orderBy in sortBy clause is removed by EliminateSorts
> -
>
> Key: SPARK-30408
> URL: https://issues.apache.org/jira/browse/SPARK-30408
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: APeng Zhang
>Priority: Major
>
> OrderBy in sortBy clause will be removed by EliminateSorts.
> code to reproduce:
> {code:java}
> val dataset = Seq( ("a", 1, 4), ("b", 2, 5), ("b2", 2, 2), ("c", 3, 6) 
> ).toDF("a", "b", "c") 
> val groupData = dataset.orderBy("b")
> val sortData = groupData.sortWithinPartitions("c")
> {code}
> The content of groupData is:
> {code:java}
> partition 0: 
> [a,1,4]
> partition 1: 
> [b,2,5]
> [b2,2,2]
> partition 2: 
> [c,3,6]{code}
> The content of sortData is:
> {code:java}
> partition 0: 
> [a,1,4]
> [b,2,5]
> partition 1: 
> [b2,2,2]
> [c,3,6]{code}
>  
> UT to cover this defect:
> In EliminateSortsSuite.scala
> {code:java}
> test("should not remove orderBy in sortBy clause") {
>   val plan = testRelation.orderBy('a.asc).sortBy('b.desc)
>   val optimized = Optimize.execute(plan.analyze)
>   val correctAnswer = testRelation.orderBy('a.asc).sortBy('b.desc).analyze
>   comparePlans(optimized, correctAnswer)
> }{code}
>  
>  
>  This test will be failed because sortBy was removed by EliminateSorts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30427) Add config item for limiting partition number when calculating statistics through File System

2020-01-07 Thread Hu Fuwang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Fuwang updated SPARK-30427:
--
Description: 
Currently, when spark need to calculate the statistics (eg. sizeInBytes) of 
table partition through file system (eg. HDFS), it does not consider the number 
of partitions. Then if the the number of partitions is huge, it will cost much 
time to calculate the statistics which may be not be that useful.

It should be reasonable to add a config item to control the limit of partition 
number allowable to calculate statistics through file system.

> Add config item for limiting partition number when calculating statistics 
> through File System
> -
>
> Key: SPARK-30427
> URL: https://issues.apache.org/jira/browse/SPARK-30427
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Priority: Major
>
> Currently, when spark need to calculate the statistics (eg. sizeInBytes) of 
> table partition through file system (eg. HDFS), it does not consider the 
> number of partitions. Then if the the number of partitions is huge, it will 
> cost much time to calculate the statistics which may be not be that useful.
> It should be reasonable to add a config item to control the limit of 
> partition number allowable to calculate statistics through file system.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30427) Add config item for limiting partition number when calculating statistics through File System

2020-01-07 Thread Hu Fuwang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Fuwang updated SPARK-30427:
--
Summary: Add config item for limiting partition number when calculating 
statistics through File System  (was: Add config item for limiting partition 
number when calculating statistics through HDFS)

> Add config item for limiting partition number when calculating statistics 
> through File System
> -
>
> Key: SPARK-30427
> URL: https://issues.apache.org/jira/browse/SPARK-30427
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30196) Bump lz4-java version to 1.7.0

2020-01-07 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010373#comment-17010373
 ] 

Takeshi Yamamuro commented on SPARK-30196:
--

Yea, it seems yes (I can reproduce this on an older Mac env. I'm checking this, 
so please give me more time ;)

> Bump lz4-java version to 1.7.0
> --
>
> Key: SPARK-30196
> URL: https://issues.apache.org/jira/browse/SPARK-30196
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28137) Data Type Formatting Functions: `to_number`

2020-01-07 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010351#comment-17010351
 ] 

Takeshi Yamamuro commented on SPARK-28137:
--

See: [https://github.com/apache/spark/pull/25963#issuecomment-571885135]

> Data Type Formatting Functions: `to_number`
> ---
>
> Key: SPARK-28137
> URL: https://issues.apache.org/jira/browse/SPARK-28137
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Return Type||Description||Example||
> |{{to_number(}}{{text}}{{, }}{{text}}{{)}}|{{numeric}}|convert string to 
> numeric|{{to_number('12,454.8-', '99G999D9S')}}|
> https://www.postgresql.org/docs/12/functions-formatting.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28137) Data Type Formatting Functions: `to_number`

2020-01-07 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-28137.
--
Resolution: Won't Fix

> Data Type Formatting Functions: `to_number`
> ---
>
> Key: SPARK-28137
> URL: https://issues.apache.org/jira/browse/SPARK-28137
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Return Type||Description||Example||
> |{{to_number(}}{{text}}{{, }}{{text}}{{)}}|{{numeric}}|convert string to 
> numeric|{{to_number('12,454.8-', '99G999D9S')}}|
> https://www.postgresql.org/docs/12/functions-formatting.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29878) Improper cache strategies in GraphX

2020-01-07 Thread Dong Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010317#comment-17010317
 ] 

Dong Wang commented on SPARK-29878:
---

So are these unnecessary caches tolerable?

These cached data is used only once in these cases, i.e., SSSPExample and 
ConectedComponentsExample, and I know that they're necessary cache for the most 
of other cases,

Is there a perfect way to handle all cases well?

> Improper cache strategies in GraphX
> ---
>
> Key: SPARK-29878
> URL: https://issues.apache.org/jira/browse/SPARK-29878
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 3.0.0
>Reporter: Dong Wang
>Priority: Major
>
> I have run examples.graphx.SSPExample and looked through the RDD dependency 
> graphs as well as persist operations. There are some improper cache 
> strategies in GraphX. The same situations also exist when I run 
> ConnectedComponentsExample.
> 1.  vertices.cache() and newEdges.cache() are unnecessary
> In SSPExample, a graph is initialized by GraphImpl.mapVertices(). In this 
> method, a GraphImpl object is created using GraphImpl.apply(vertices, edges), 
> and RDD vertices/newEdges are cached in apply(). But these two RDDs are not 
> directly used anymore (their children RDDs has been cached) in SSPExample, so 
> the persists can be unnecessary here. 
> However, the other examples may need these two persists, so I think they 
> cannot be simply removed. It might be hard to fix.
> {code:scala}
>   def apply[VD: ClassTag, ED: ClassTag](
>   vertices: VertexRDD[VD],
>   edges: EdgeRDD[ED]): GraphImpl[VD, ED] = {
> vertices.cache() // It is unnecessary for SSPExample and 
> ConnectedComponentsExample
> // Convert the vertex partitions in edges to the correct type
> val newEdges = edges.asInstanceOf[EdgeRDDImpl[ED, _]]
>   .mapEdgePartitions((pid, part) => part.withoutVertexAttributes[VD])
>   .cache() // It is unnecessary for SSPExample and 
> ConnectedComponentsExample
> GraphImpl.fromExistingRDDs(vertices, newEdges)
>   }
> {code}
> 2. Missing persist on newEdges
> SSSPExample will invoke pregel to do execution. Pregel will ultilize 
> ReplicatedVertexView.upgrade(). I find that RDD newEdges will be directly use 
> by multiple actions in Pregel. So newEdges should be persisted.
> Same as the above issue, this issue is also found in 
> ConnectedComponentsExample. It is also hard to fix, because the persist added 
> may be unnecessary for other examples.
> {code:scala}
> // Pregel.scala
> // compute the messages
> var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) // 
> newEdges is created here
> val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)](
>   checkpointInterval, graph.vertices.sparkContext)
> messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]])
> var activeMessages = messages.count() // The first time use newEdges
> ...
> while (activeMessages > 0 && i < maxIterations) {
>   // Receive the messages and update the vertices.
>   prevG = g
>   g = g.joinVertices(messages)(vprog) // Generate g will depends on 
> newEdges
>   ...
>   activeMessages = messages.count() // The second action to use newEdges. 
> newEdges should be unpersisted after this instruction.
> {code}
> {code:scala}
> // ReplicatedVertexView.scala
>   def upgrade(vertices: VertexRDD[VD], includeSrc: Boolean, includeDst: 
> Boolean): Unit = {
>   ...
>val newEdges = 
> edges.withPartitionsRDD(edges.partitionsRDD.zipPartitions(shippedVerts) {
> (ePartIter, shippedVertsIter) => ePartIter.map {
>   case (pid, edgePartition) =>
> (pid, 
> edgePartition.updateVertices(shippedVertsIter.flatMap(_._2.iterator)))
> }
>   })
>   edges = newEdges // newEdges should be persisted
>   hasSrcId = includeSrc
>   hasDstId = includeDst
> }
>   }
> {code}
> As I don't have much knowledge about Graphx, so I don't know how to fix these 
> issues well.
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30125) Remove PostgreSQL dialect

2020-01-07 Thread Yuanjian Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010304#comment-17010304
 ] 

Yuanjian Li commented on SPARK-30125:
-

Also link #26940 with this Jira.

> Remove PostgreSQL dialect
> -
>
> Key: SPARK-30125
> URL: https://issues.apache.org/jira/browse/SPARK-30125
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> As the discussion in 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html],
>  we need to remove PostgreSQL dialect form code base for several reasons:
> 1. The current approach makes the codebase complicated and hard to maintain.
> 2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now.
>  
> Curently we have 3 features under PostgreSQL dialect:
> 1. SPARK-27931: when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. 
> are also allowed as true string.
> 2. SPARK-29364: `date - date`  returns interval in Spark (SQL standard 
> behavior), but return int in PostgreSQL
> 3. SPARK-28395: `int / int` returns double in Spark, but returns int in 
> PostgreSQL. (there is no standard)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30450) Exclude .git folder for python linter

2020-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30450.
---
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 27121
[https://github.com/apache/spark/pull/27121]

> Exclude .git folder for python linter
> -
>
> Key: SPARK-30450
> URL: https://issues.apache.org/jira/browse/SPARK-30450
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> The python linter shouldn't include the .git folder. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30429) WideSchemaBenchmark fails with OOM

2020-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30429.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27117
[https://github.com/apache/spark/pull/27117]

> WideSchemaBenchmark fails with OOM
> --
>
> Key: SPARK-30429
> URL: https://issues.apache.org/jira/browse/SPARK-30429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: WideSchemaBenchmark_console.txt
>
>
> Run WideSchemaBenchmark on the master (commit 
> bc16bb1dd095c9e1c8deabf6ac0d528441a81d88) via:
> {code}
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark"
> {code}
> This fails with:
> {code}
> Caused by: java.lang.reflect.InvocationTargetException
> [error]   at 
> sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source)
> [error]   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> [error]   at 
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> [error]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$7(TreeNode.scala:468)
> [error]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
> [error]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$1(TreeNode.scala:467)
> [error]   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
> [error]   ... 132 more
> [error] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> [error]   at java.util.Arrays.copyOfRange(Arrays.java:3664)
> [error]   at java.lang.String.(String.java:207)
> [error]   at java.lang.StringBuilder.toString(StringBuilder.java:407)
> [error]   at 
> org.apache.spark.sql.types.StructType.catalogString(StructType.scala:411)
> [error]   at 
> org.apache.spark.sql.types.StructType.$anonfun$catalogString$1(StructType.scala:410)
> [error]   at 
> org.apache.spark.sql.types.StructType$$Lambda$2441/1040526643.apply(Unknown 
> Source)
> {code}
> Full stack dump is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30429) WideSchemaBenchmark fails with OOM

2020-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30429:
-

Assignee: L. C. Hsieh

> WideSchemaBenchmark fails with OOM
> --
>
> Key: SPARK-30429
> URL: https://issues.apache.org/jira/browse/SPARK-30429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: L. C. Hsieh
>Priority: Major
> Attachments: WideSchemaBenchmark_console.txt
>
>
> Run WideSchemaBenchmark on the master (commit 
> bc16bb1dd095c9e1c8deabf6ac0d528441a81d88) via:
> {code}
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark"
> {code}
> This fails with:
> {code}
> Caused by: java.lang.reflect.InvocationTargetException
> [error]   at 
> sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source)
> [error]   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> [error]   at 
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> [error]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$7(TreeNode.scala:468)
> [error]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
> [error]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$1(TreeNode.scala:467)
> [error]   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
> [error]   ... 132 more
> [error] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> [error]   at java.util.Arrays.copyOfRange(Arrays.java:3664)
> [error]   at java.lang.String.(String.java:207)
> [error]   at java.lang.StringBuilder.toString(StringBuilder.java:407)
> [error]   at 
> org.apache.spark.sql.types.StructType.catalogString(StructType.scala:411)
> [error]   at 
> org.apache.spark.sql.types.StructType.$anonfun$catalogString$1(StructType.scala:410)
> [error]   at 
> org.apache.spark.sql.types.StructType$$Lambda$2441/1040526643.apply(Unknown 
> Source)
> {code}
> Full stack dump is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30453) Update AppVeyor R version to 3.6.2

2020-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30453.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27124
[https://github.com/apache/spark/pull/27124]

> Update AppVeyor R version to 3.6.2
> --
>
> Key: SPARK-30453
> URL: https://issues.apache.org/jira/browse/SPARK-30453
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30453) Update AppVeyor R version to 3.6.2

2020-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30453:
-

Assignee: Hyukjin Kwon

> Update AppVeyor R version to 3.6.2
> --
>
> Key: SPARK-30453
> URL: https://issues.apache.org/jira/browse/SPARK-30453
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SparkR
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30302) Complete info for show create table for views

2020-01-07 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-30302.
--
Fix Version/s: 3.0.0
 Assignee: Zhenhua Wang
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/26944]

> Complete info for show create table for views
> -
>
> Key: SPARK-30302
> URL: https://issues.apache.org/jira/browse/SPARK-30302
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Minor
> Fix For: 3.0.0
>
>
> Add table/column comments and table properties to the result of show create 
> table of views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering

2020-01-07 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010271#comment-17010271
 ] 

Takeshi Yamamuro commented on SPARK-30421:
--

Based on the current implementation, drop and select (drop is a shorthand for a 
partial use-case of select?) seems to have the same semantics. If so, that 
query might be correct in lazy evaluation. 

btw, for changing this behaviour, IMO it would be better to reconstruct 
dataframe([https://github.com/maropu/spark/commit/fac04161405b9ee755b4c7f87de2a144c609c7fa])
 instead of modifying the resolution logic. That's because the resolution logic 
affects many places.  

> Dropped columns still available for filtering
> -
>
> Key: SPARK-30421
> URL: https://issues.apache.org/jira/browse/SPARK-30421
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Tobias Hermann
>Priority: Minor
>
> The following minimal example:
> {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
> df.select("foo").where($"bar" === "a").show
> df.drop("bar").where($"bar" === "a").show
> {quote}
> should result in an error like the following:
> {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given 
> input columns: [foo];
> {quote}
> However, it does not but instead works without error, as if the column "bar" 
> would exist.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30381) GBT reuse treePoints for all trees

2020-01-07 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-30381.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27103
[https://github.com/apache/spark/pull/27103]

> GBT reuse treePoints for all trees
> --
>
> Key: SPARK-30381
> URL: https://issues.apache.org/jira/browse/SPARK-30381
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.0.0
>
>
> In existing GBT, for each tree, it will first compute avaiable splits of each 
> feature (via RandomForest.findSplits), based on sampled dataset at this 
> iteration. Then it will use these splits to discretize vectors into 
> BaggedPoint[TreePoint]s. The BaggedPoints (of the same size of input vectors) 
> are then cached and used at this iteration. Note that the splits for 
> discretization in each tree are different (if subsamplingRate<1), only 
> because the sampled vectors are different.
> However, the splits at different iterations shoud be similar if sampled 
> dataset is big enough, and even the same if subsamplingRate=1.
>  
> However, in other famous GBT impls (like XGBoost/lightGBM) with binned 
> features, the splits for discretization is the same for different iterations:
> {code:java}
> import xgboost as xgb
> from sklearn.datasets import load_svmlight_file
> X, y = 
> load_svmlight_file('/data0/Dev/Opensource/spark/data/mllib/sample_linear_regression_data.txt')
> dtrain = xgb.DMatrix(X[:, :2], label=y)
> num_round = 3
> param = {'max_depth': 2, 'eta': 1, 'objective': 'reg:squarederror', 
> 'tree_method': 'hist', 'max_bin': 2, 'eta': 0.01, 'subsample':0.5}
> bst = xgb.train(param, dtrain, num_round)
> bst.trees_to_dataframe('/tmp/bst')
> Out[61]: 
> Tree  Node   ID Feature Split  Yes   No MissingGain  Cover
> 0  0 0  0-0  f1  0.000408  0-1  0-2 0-1  170.337143  256.0
> 1  0 1  0-1  f0  0.003531  0-3  0-4 0-3   44.865482  121.0
> 2  0 2  0-2  f0  0.003531  0-5  0-6 0-5  125.615570  135.0
> 3  0 3  0-3Leaf   NaN  NaN  NaN NaN   -0.010050   67.0
> 4  0 4  0-4Leaf   NaN  NaN  NaN NaN0.002126   54.0
> 5  0 5  0-5Leaf   NaN  NaN  NaN NaN0.020972   69.0
> 6  0 6  0-6Leaf   NaN  NaN  NaN NaN0.001714   66.0
> 7  1 0  1-0  f0  0.003531  1-1  1-2 1-1   50.417793  263.0
> 8  1 1  1-1  f1  0.000408  1-3  1-4 1-3   48.732742  124.0
> 9  1 2  1-2  f1  0.000408  1-5  1-6 1-5   52.832161  139.0
> 10 1 3  1-3Leaf   NaN  NaN  NaN NaN   -0.012784   63.0
> 11 1 4  1-4Leaf   NaN  NaN  NaN NaN   -0.000287   61.0
> 12 1 5  1-5Leaf   NaN  NaN  NaN NaN0.008661   64.0
> 13 1 6  1-6Leaf   NaN  NaN  NaN NaN   -0.003624   75.0
> 14 2 0  2-0  f1  0.000408  2-1  2-2 2-1   62.136013  242.0
> 15 2 1  2-1  f0  0.003531  2-3  2-4 2-3  150.537781  118.0
> 16 2 2  2-2  f0  0.003531  2-5  2-6 2-53.829046  124.0
> 17 2 3  2-3Leaf   NaN  NaN  NaN NaN   -0.016737   65.0
> 18 2 4  2-4Leaf   NaN  NaN  NaN NaN0.005809   53.0
> 19 2 5  2-5Leaf   NaN  NaN  NaN NaN0.005251   60.0
> 20 2 6  2-6Leaf   NaN  NaN  NaN NaN0.001709   64.0
>  {code}
>  
> We can see that even if we set subsample=0.5, the three trees share the same 
> splits.
>  
> So I think we could reuse the splits and treePoints at all iterations:
> at iteration=0, compute the splits on whole training dataset, and use the 
> splits to generate treepoints.
> At each iteration, directly generate baggedPoints based on the treePoints.
> Here we do not need to persist/unpersist the internal training dataset for 
> each tree.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30453) Update AppVeyor R version to 3.6.2

2020-01-07 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-30453:


 Summary: Update AppVeyor R version to 3.6.2
 Key: SPARK-30453
 URL: https://issues.apache.org/jira/browse/SPARK-30453
 Project: Spark
  Issue Type: Improvement
  Components: Build, SparkR
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28264) Revisiting Python / pandas UDF

2020-01-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28264:


Assignee: Hyukjin Kwon  (was: Reynold Xin)

> Revisiting Python / pandas UDF
> --
>
> Key: SPARK-28264
> URL: https://issues.apache.org/jira/browse/SPARK-28264
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Reynold Xin
>Assignee: Hyukjin Kwon
>Priority: Blocker
>
> In the past two years, the pandas UDFs are perhaps the most important changes 
> to Spark for Python data science. However, these functionalities have evolved 
> organically, leading to some inconsistencies and confusions among users. This 
> document revisits UDF definition and naming, as a result of discussions among 
> Xiangrui, Li Jin, Hyukjin, and Reynold.
> -See document here: 
> [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]-
>  New proposal: 
> https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering

2020-01-07 Thread Aman Omer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010227#comment-17010227
 ] 

Aman Omer commented on SPARK-30421:
---

cc [~maropu]

> Dropped columns still available for filtering
> -
>
> Key: SPARK-30421
> URL: https://issues.apache.org/jira/browse/SPARK-30421
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Tobias Hermann
>Priority: Minor
>
> The following minimal example:
> {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
> df.select("foo").where($"bar" === "a").show
> df.drop("bar").where($"bar" === "a").show
> {quote}
> should result in an error like the following:
> {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given 
> input columns: [foo];
> {quote}
> However, it does not but instead works without error, as if the column "bar" 
> would exist.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch

2020-01-07 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010223#comment-17010223
 ] 

Takeshi Yamamuro commented on SPARK-26249:
--

I'll close this because the corresponding pr is inactive (automatically 
closed). If necessary, please reopen this. Thanks.

> Extension Points Enhancements to inject a rule in order and to add a batch
> --
>
> Key: SPARK-26249
> URL: https://issues.apache.org/jira/browse/SPARK-26249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sunitha Kambhampati
>Priority: Major
>
> +Motivation:+  
> Spark has extension points API to allow third parties to extend Spark with 
> custom optimization rules. The current API does not allow fine grain control 
> on when the optimization rule will be exercised. In the current API,  there 
> is no way to add a batch to the optimization using the SparkSessionExtensions 
> API, similar to the postHocOptimizationBatches in SparkOptimizer.
> In our use cases, we have optimization rules that we want to add as 
> extensions to a batch in a specific order.
> +Proposal:+ 
> Add 2 new API's to the existing Extension Points to allow for more 
> flexibility for third party users of Spark. 
>  # Inject a optimizer rule to a batch in order 
>  # Inject a optimizer batch in order
> The design spec is here:
> [https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26249) Extension Points Enhancements to inject a rule in order and to add a batch

2020-01-07 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-26249.
--
Resolution: Won't Fix

> Extension Points Enhancements to inject a rule in order and to add a batch
> --
>
> Key: SPARK-26249
> URL: https://issues.apache.org/jira/browse/SPARK-26249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sunitha Kambhampati
>Priority: Major
>
> +Motivation:+  
> Spark has extension points API to allow third parties to extend Spark with 
> custom optimization rules. The current API does not allow fine grain control 
> on when the optimization rule will be exercised. In the current API,  there 
> is no way to add a batch to the optimization using the SparkSessionExtensions 
> API, similar to the postHocOptimizationBatches in SparkOptimizer.
> In our use cases, we have optimization rules that we want to add as 
> extensions to a batch in a specific order.
> +Proposal:+ 
> Add 2 new API's to the existing Extension Points to allow for more 
> flexibility for third party users of Spark. 
>  # Inject a optimizer rule to a batch in order 
>  # Inject a optimizer batch in order
> The design spec is here:
> [https://drive.google.com/file/d/1m7rQZ9OZFl0MH5KS12CiIg3upLJSYfsA/view?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28825) Document EXPLAIN Statement in SQL Reference.

2020-01-07 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-28825.
--
Fix Version/s: 3.0.0
 Assignee: pavithra ramachandran
   Resolution: Fixed

Resolved by 
[https://github.com/apache/spark/pull/26970|https://github.com/apache/spark/pull/26970#issuecomment-571833889]

> Document EXPLAIN Statement in SQL Reference.
> 
>
> Key: SPARK-28825
> URL: https://issues.apache.org/jira/browse/SPARK-28825
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: pavithra ramachandran
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28825) Document EXPLAIN Statement in SQL Reference.

2020-01-07 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-28825:
-
Affects Version/s: (was: 2.4.3)
   3.0.0

> Document EXPLAIN Statement in SQL Reference.
> 
>
> Key: SPARK-28825
> URL: https://issues.apache.org/jira/browse/SPARK-28825
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24884) Implement regexp_extract_all

2020-01-07 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-24884.
--
Resolution: Won't Fix

> Implement regexp_extract_all
> 
>
> Key: SPARK-24884
> URL: https://issues.apache.org/jira/browse/SPARK-24884
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nick Nicolini
>Priority: Major
>
> I've recently hit many cases of regexp parsing where we need to match on 
> something that is always arbitrary in length; for example, a text block that 
> looks something like:
> {code:java}
> AAA:WORDS|
> BBB:TEXT|
> MSG:ASDF|
> MSG:QWER|
> ...
> MSG:ZXCV|{code}
> Where I need to pull out all values between "MSG:" and "|", which can occur 
> in each instance between 1 and n times. I cannot reliably use the existing 
> {{regexp_extract}} method since the number of occurrences is always 
> arbitrary, and while I can write a UDF to handle this it'd be great if this 
> was supported natively in Spark.
> Perhaps we can implement something like {{regexp_extract_all}} as 
> [Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
> [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
>  have?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24884) Implement regexp_extract_all

2020-01-07 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010193#comment-17010193
 ] 

Takeshi Yamamuro commented on SPARK-24884:
--

I'll close this because the corresponding pr is inactive (automatically 
closed). If necessary, please reopen this. Thanks.

> Implement regexp_extract_all
> 
>
> Key: SPARK-24884
> URL: https://issues.apache.org/jira/browse/SPARK-24884
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Nick Nicolini
>Priority: Major
>
> I've recently hit many cases of regexp parsing where we need to match on 
> something that is always arbitrary in length; for example, a text block that 
> looks something like:
> {code:java}
> AAA:WORDS|
> BBB:TEXT|
> MSG:ASDF|
> MSG:QWER|
> ...
> MSG:ZXCV|{code}
> Where I need to pull out all values between "MSG:" and "|", which can occur 
> in each instance between 1 and n times. I cannot reliably use the existing 
> {{regexp_extract}} method since the number of occurrences is always 
> arbitrary, and while I can write a UDF to handle this it'd be great if this 
> was supported natively in Spark.
> Perhaps we can implement something like {{regexp_extract_all}} as 
> [Presto|https://prestodb.io/docs/current/functions/regexp.html] and 
> [Pig|https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html]
>  have?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30452) Add predict and numFeatures in Python IsotonicRegressionModel

2020-01-07 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-30452:
--

 Summary: Add predict and numFeatures in Python 
IsotonicRegressionModel
 Key: SPARK-30452
 URL: https://issues.apache.org/jira/browse/SPARK-30452
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: Huaxin Gao


Since IsotonicRegressionModel doesn't extend JavaPredictionModel, predict and 
numFeatures need to be added explicitly. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29167) Metrics of Analyzer/Optimizer use Scientific counting is not human readable

2020-01-07 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-29167.
--
Resolution: Won't Fix

> Metrics of Analyzer/Optimizer use Scientific counting is not human readable
> ---
>
> Key: SPARK-29167
> URL: https://issues.apache.org/jira/browse/SPARK-29167
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Metrics of Analyzer/Optimizer use Scientific counting is not human readable
> !image-2019-09-19-11-36-18-966.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29167) Metrics of Analyzer/Optimizer use Scientific counting is not human readable

2020-01-07 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010187#comment-17010187
 ] 

Takeshi Yamamuro commented on SPARK-29167:
--

I'll close this because no committer strongly supports this. If necessary, 
please reopen this. Thanks.

> Metrics of Analyzer/Optimizer use Scientific counting is not human readable
> ---
>
> Key: SPARK-29167
> URL: https://issues.apache.org/jira/browse/SPARK-29167
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Metrics of Analyzer/Optimizer use Scientific counting is not human readable
> !image-2019-09-19-11-36-18-966.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30417) SPARK-29976 calculation of slots wrong for Standalone Mode

2020-01-07 Thread Xingbo Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010156#comment-17010156
 ] 

Xingbo Jiang commented on SPARK-30417:
--

Good catch! `max(conf.get(EXECUTOR_CORES) / sched.CPUS_PER_TASK, 1)` seems good 
enough for me. Thanks!

> SPARK-29976 calculation of slots wrong for Standalone Mode
> --
>
> Key: SPARK-30417
> URL: https://issues.apache.org/jira/browse/SPARK-30417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> In SPARK-29976 we added a config to determine if we should allow speculation 
> when the number of tasks is less then the number of slots on a single 
> executor.  The problem is that for standalone mode (and  mesos coarse 
> grained) the EXECUTOR_CORES config is not set properly by default. In those 
> modes the number of executor cores is all the cores of the Worker.    The 
> default of EXECUTOR_CORES is 1.
> The calculation:
> {color:#80}val {color}{color:#660e7a}speculationTasksLessEqToSlots 
> {color}= {color:#660e7a}numTasks {color}<= 
> ({color:#660e7a}conf{color}.get({color:#660e7a}EXECUTOR_CORES{color}) / 
> sched.{color:#660e7a}CPUS_PER_TASK{color})
> If someone set the cpus per task > 1 then this would end up being false even 
> if 1 task.  Note that the default case where cpus per task is 1 and executor 
> cores is 1 it works out ok but is only applied if 1 task vs number of slots 
> on the executor.
> Here we really don't know the number of executor cores for standalone mode or 
> mesos so I think a decent solution is to just use 1 in those cases and 
> document the difference.
> Something like 
> max({color:#660e7a}conf{color}.get({color:#660e7a}EXECUTOR_CORES{color}) / 
> sched.{color:#660e7a}CPUS_PER_TASK{color}, 1)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30417) SPARK-29976 calculation of slots wrong for Standalone Mode

2020-01-07 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010146#comment-17010146
 ] 

Thomas Graves commented on SPARK-30417:
---

The only way for standalone mode would be to look at what each executor 
registers with. Theoretically different executors could have different number 
of cores.  There are actually other issues (SPARK-30299 for instance) with this 
in the code as well that I think we need a global solution for.  So perhaps for 
this Jira we do the easy thing like I suggested and then we have have a 
separate Jira to look at handling this better in the future.

> SPARK-29976 calculation of slots wrong for Standalone Mode
> --
>
> Key: SPARK-30417
> URL: https://issues.apache.org/jira/browse/SPARK-30417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> In SPARK-29976 we added a config to determine if we should allow speculation 
> when the number of tasks is less then the number of slots on a single 
> executor.  The problem is that for standalone mode (and  mesos coarse 
> grained) the EXECUTOR_CORES config is not set properly by default. In those 
> modes the number of executor cores is all the cores of the Worker.    The 
> default of EXECUTOR_CORES is 1.
> The calculation:
> {color:#80}val {color}{color:#660e7a}speculationTasksLessEqToSlots 
> {color}= {color:#660e7a}numTasks {color}<= 
> ({color:#660e7a}conf{color}.get({color:#660e7a}EXECUTOR_CORES{color}) / 
> sched.{color:#660e7a}CPUS_PER_TASK{color})
> If someone set the cpus per task > 1 then this would end up being false even 
> if 1 task.  Note that the default case where cpus per task is 1 and executor 
> cores is 1 it works out ok but is only applied if 1 task vs number of slots 
> on the executor.
> Here we really don't know the number of executor cores for standalone mode or 
> mesos so I think a decent solution is to just use 1 in those cases and 
> document the difference.
> Something like 
> max({color:#660e7a}conf{color}.get({color:#660e7a}EXECUTOR_CORES{color}) / 
> sched.{color:#660e7a}CPUS_PER_TASK{color}, 1)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30417) SPARK-29976 calculation of slots wrong for Standalone Mode

2020-01-07 Thread Yuchen Huo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010142#comment-17010142
 ] 

Yuchen Huo commented on SPARK-30417:


[~tgraves] Sure. Is there a more stable way to get the number of cores the 
executor is using instead of checking the value of EXECUTOR_CORES which might 
not be set?

 

cc [~jiangxb1987]

> SPARK-29976 calculation of slots wrong for Standalone Mode
> --
>
> Key: SPARK-30417
> URL: https://issues.apache.org/jira/browse/SPARK-30417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> In SPARK-29976 we added a config to determine if we should allow speculation 
> when the number of tasks is less then the number of slots on a single 
> executor.  The problem is that for standalone mode (and  mesos coarse 
> grained) the EXECUTOR_CORES config is not set properly by default. In those 
> modes the number of executor cores is all the cores of the Worker.    The 
> default of EXECUTOR_CORES is 1.
> The calculation:
> {color:#80}val {color}{color:#660e7a}speculationTasksLessEqToSlots 
> {color}= {color:#660e7a}numTasks {color}<= 
> ({color:#660e7a}conf{color}.get({color:#660e7a}EXECUTOR_CORES{color}) / 
> sched.{color:#660e7a}CPUS_PER_TASK{color})
> If someone set the cpus per task > 1 then this would end up being false even 
> if 1 task.  Note that the default case where cpus per task is 1 and executor 
> cores is 1 it works out ok but is only applied if 1 task vs number of slots 
> on the executor.
> Here we really don't know the number of executor cores for standalone mode or 
> mesos so I think a decent solution is to just use 1 in those cases and 
> document the difference.
> Something like 
> max({color:#660e7a}conf{color}.get({color:#660e7a}EXECUTOR_CORES{color}) / 
> sched.{color:#660e7a}CPUS_PER_TASK{color}, 1)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30382) start-thriftserver throws ClassNotFoundException

2020-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30382.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27042
[https://github.com/apache/spark/pull/27042]

> start-thriftserver throws ClassNotFoundException
> 
>
> Key: SPARK-30382
> URL: https://issues.apache.org/jira/browse/SPARK-30382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Minor
> Fix For: 3.0.0
>
>
> start-thriftserver.sh --help throws
> {code}
> .
>  
> Thrift server options:
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/logging/log4j/spi/LoggerContextFactory
>   at org.apache.hive.service.server.HiveServer2.main(HiveServer2.java:167)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:82)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.logging.log4j.spi.LoggerContextFactory
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 3 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30382) start-thriftserver throws ClassNotFoundException

2020-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30382:
-

Assignee: Ajith S

> start-thriftserver throws ClassNotFoundException
> 
>
> Key: SPARK-30382
> URL: https://issues.apache.org/jira/browse/SPARK-30382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Minor
>
> start-thriftserver.sh --help throws
> {code}
> .
>  
> Thrift server options:
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/logging/log4j/spi/LoggerContextFactory
>   at org.apache.hive.service.server.HiveServer2.main(HiveServer2.java:167)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$.main(HiveThriftServer2.scala:82)
>   at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2.main(HiveThriftServer2.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.logging.log4j.spi.LoggerContextFactory
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   ... 3 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30451) Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have functions to remove requests

2020-01-07 Thread Thomas Graves (Jira)
Thomas Graves created SPARK-30451:
-

 Summary: Stage level Sched: 
ExecutorResourceRequests/TaskResourceRequests should have functions to remove 
requests
 Key: SPARK-30451
 URL: https://issues.apache.org/jira/browse/SPARK-30451
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Thomas Graves


Stage level Sched: ExecutorResourceRequests/TaskResourceRequests should have 
functions to remove requests

Currently in the design ExecutorResourceRequests and TaskREsourceRequests are 
mutable and users can update as they want. It would make sense to add api's to 
remove certain resource requirements from them. This would allow a user to 
create one ExecutorResourceRequests object and then if they want to just 
add/remove something from it they easily could without having to recreate all 
the requests in that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30450) Exclude .git folder for python linter

2020-01-07 Thread Yin Huai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-30450:
-
Affects Version/s: (was: 2.4.4)
   3.0.0

> Exclude .git folder for python linter
> -
>
> Key: SPARK-30450
> URL: https://issues.apache.org/jira/browse/SPARK-30450
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Minor
>
> The python linter shouldn't include the .git folder. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30450) Exclude .git folder for python linter

2020-01-07 Thread Yin Huai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-30450:
-
Priority: Minor  (was: Major)

> Exclude .git folder for python linter
> -
>
> Key: SPARK-30450
> URL: https://issues.apache.org/jira/browse/SPARK-30450
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Minor
>
> The python linter shouldn't include the .git folder. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30450) Exclude .git folder for python linter

2020-01-07 Thread Yin Huai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-30450:


Assignee: Eric Chang

> Exclude .git folder for python linter
> -
>
> Key: SPARK-30450
> URL: https://issues.apache.org/jira/browse/SPARK-30450
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
>
> The python linter shouldn't include the .git folder. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30450) Exclude .git folder for python linter

2020-01-07 Thread Eric Chang (Jira)
Eric Chang created SPARK-30450:
--

 Summary: Exclude .git folder for python linter
 Key: SPARK-30450
 URL: https://issues.apache.org/jira/browse/SPARK-30450
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Eric Chang


The python linter shouldn't include the .git folder. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2020-01-07 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009988#comment-17009988
 ] 

Steve Loughran commented on SPARK-2984:
---

bq. As part of your recommendation, is it guaranteed that parquet filenames 
will be unique across jobs?

no idea. S3A committer defaults to inserting a UUID into the filename to meet 
that guarantee.

bq. Also, when "outputting independently", is it ok to use v2 commit algorithm?

Only if each independent job fails completely if there's a failure/timeout 
during task commit (i.e do not attempt to commit >1 task attempt for the same 
task). Spark does not currently do that , AFAIK

> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> -- Chen Song at 
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html
> {noformat}
> I am running a Spark Streaming job that uses saveAsTextFiles to save results 
> into hdfs files. However, it has an exception after 20 batches
> result-140631234/_temporary/0/task_201407251119__m_03 does not 
> exist.
> {noformat}
> and
> 

[jira] [Commented] (SPARK-30448) accelerator aware scheduling enforce cores as limiting resource

2020-01-07 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009989#comment-17009989
 ] 

Thomas Graves commented on SPARK-30448:
---

Note this actually overlaps with 
https://issues.apache.org/jira/browse/SPARK-30446 since with this change some 
of those checks don't make sense.

> accelerator aware scheduling enforce cores as limiting resource
> ---
>
> Key: SPARK-30448
> URL: https://issues.apache.org/jira/browse/SPARK-30448
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> For the first version of accelerator aware scheduling(SPARK-27495), the SPIP 
> had a condition that we can support dynamic allocation because we were going 
> to have a strict requirement that we don't waste any resources. This means 
> that the number of number of slots each executor has could be calculated from 
> the number of cores and task cpus just as is done today.
> Somewhere along the line of development we relaxed that and only warn when we 
> are wasting resources. This breaks the dynamic allocation logic if the 
> limiting resource is no longer the cores.  This means we will request less 
> executors then we really need to run everything.
> We have to enforce that cores is always the limiting resource so we should 
> throw if its not.
> I guess we could only make this a requirement with dynamic allocation on, but 
> to make the behavior consistent I would say we just require it across the 
> board.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30442) Write mode ignored when using CodecStreams

2020-01-07 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009963#comment-17009963
 ] 

Maxim Gekk commented on SPARK-30442:


> This can cause issues, particularly with aws tools, that make it impossible 
>to retry.

Could you clarify how it makes retry impossible. When the mode is set to 
overwrite, Spark deletes entire folder and writes new files - should be no 
clashes. In the append mode, new files are added - Spark does not append to 
existing files. What's the situation when files should be overwritten? 

> Write mode ignored when using CodecStreams
> --
>
> Key: SPARK-30442
> URL: https://issues.apache.org/jira/browse/SPARK-30442
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.4
>Reporter: Jesse Collins
>Priority: Major
>
> Overwrite is hardcoded to false in the codec stream. This can cause issues, 
> particularly with aws tools, that make it impossible to retry.
> Ideally, this should be read from the write mode set for the DataWriter that 
> is writing through this codec class.
> [https://github.com/apache/spark/blame/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala#L81]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30449) Introducing get_dummies method in pyspark

2020-01-07 Thread Krishna Kumar Tiwari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krishna Kumar Tiwari updated SPARK-30449:
-
Flags: Important

> Introducing get_dummies method in pyspark
> -
>
> Key: SPARK-30449
> URL: https://issues.apache.org/jira/browse/SPARK-30449
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Krishna Kumar Tiwari
>Priority: Major
>
> Introducing get_dummies method in pyspark same as pandas.
> Many times when using categorical variable and we want to flatten the data to 
> do one-hot encoding to generate columns and fill the matrix, get_dummies is 
> very useful in that scenario.
>  
> The objective here is to introduce get_dummies to pyspark.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30449) Introducing get_dummies method in pyspark

2020-01-07 Thread Krishna Kumar Tiwari (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009924#comment-17009924
 ] 

Krishna Kumar Tiwari commented on SPARK-30449:
--

I am already working on this, will share the PR soon. 

> Introducing get_dummies method in pyspark
> -
>
> Key: SPARK-30449
> URL: https://issues.apache.org/jira/browse/SPARK-30449
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Krishna Kumar Tiwari
>Priority: Major
>
> Introducing get_dummies method in pyspark same as pandas.
> Many times when using categorical variable and we want to flatten the data to 
> do one-hot encoding to generate columns and fill the matrix, get_dummies is 
> very useful in that scenario.
>  
> The objective here is to introduce get_dummies to pyspark.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30449) Introducing get_dummies method in pyspark

2020-01-07 Thread Krishna Kumar Tiwari (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krishna Kumar Tiwari updated SPARK-30449:
-
Issue Type: Task  (was: New Feature)

> Introducing get_dummies method in pyspark
> -
>
> Key: SPARK-30449
> URL: https://issues.apache.org/jira/browse/SPARK-30449
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Krishna Kumar Tiwari
>Priority: Major
>
> Introducing get_dummies method in pyspark same as pandas.
> Many times when using categorical variable and we want to flatten the data to 
> do one-hot encoding to generate columns and fill the matrix, get_dummies is 
> very useful in that scenario.
>  
> The objective here is to introduce get_dummies to pyspark.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30449) Introducing get_dummies method in pyspark

2020-01-07 Thread Krishna Kumar Tiwari (Jira)
Krishna Kumar Tiwari created SPARK-30449:


 Summary: Introducing get_dummies method in pyspark
 Key: SPARK-30449
 URL: https://issues.apache.org/jira/browse/SPARK-30449
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 2.4.4
Reporter: Krishna Kumar Tiwari


Introducing get_dummies method in pyspark same as pandas.

Many times when using categorical variable and we want to flatten the data to 
do one-hot encoding to generate columns and fill the matrix, get_dummies is 
very useful in that scenario.

 

The objective here is to introduce get_dummies to pyspark.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30448) accelerator aware scheduling enforce cores as limiting resource

2020-01-07 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009901#comment-17009901
 ] 

Thomas Graves commented on SPARK-30448:
---

note there are other calculations throughout spark code that calculate the 
number of slots so I think its best for now just to require cores to be 
limiting resource

> accelerator aware scheduling enforce cores as limiting resource
> ---
>
> Key: SPARK-30448
> URL: https://issues.apache.org/jira/browse/SPARK-30448
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> For the first version of accelerator aware scheduling(SPARK-27495), the SPIP 
> had a condition that we can support dynamic allocation because we were going 
> to have a strict requirement that we don't waste any resources. This means 
> that the number of number of slots each executor has could be calculated from 
> the number of cores and task cpus just as is done today.
> Somewhere along the line of development we relaxed that and only warn when we 
> are wasting resources. This breaks the dynamic allocation logic if the 
> limiting resource is no longer the cores.  This means we will request less 
> executors then we really need to run everything.
> We have to enforce that cores is always the limiting resource so we should 
> throw if its not.
> I guess we could only make this a requirement with dynamic allocation on, but 
> to make the behavior consistent I would say we just require it across the 
> board.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30446) Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases

2020-01-07 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009896#comment-17009896
 ] 

Thomas Graves commented on SPARK-30446:
---

working on this

> Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases
> --
>
> Key: SPARK-30446
> URL: https://issues.apache.org/jira/browse/SPARK-30446
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> with accelerator aware scheduling SparkContext.checkResourcesPerTask
> Tries to make sure that users have configured things properly and warn or 
> error if not.
> It doesn't properly handle all cases like warning if cpu resources are being 
> wasted.  We should test this better and handle those. 
> I fixed these in the stage level scheduling but not sure the timeline on 
> getting that in so we may want to fix this separately as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30448) accelerator aware scheduling enforce cores as limiting resource

2020-01-07 Thread Thomas Graves (Jira)
Thomas Graves created SPARK-30448:
-

 Summary: accelerator aware scheduling enforce cores as limiting 
resource
 Key: SPARK-30448
 URL: https://issues.apache.org/jira/browse/SPARK-30448
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Thomas Graves


For the first version of accelerator aware scheduling(SPARK-27495), the SPIP 
had a condition that we can support dynamic allocation because we were going to 
have a strict requirement that we don't waste any resources. This means that 
the number of number of slots each executor has could be calculated from the 
number of cores and task cpus just as is done today.

Somewhere along the line of development we relaxed that and only warn when we 
are wasting resources. This breaks the dynamic allocation logic if the limiting 
resource is no longer the cores.  This means we will request less executors 
then we really need to run everything.

We have to enforce that cores is always the limiting resource so we should 
throw if its not.

I guess we could only make this a requirement with dynamic allocation on, but 
to make the behavior consistent I would say we just require it across the board.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30039) CREATE FUNCTION should look up catalog/table like v2 commands

2020-01-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30039:
---

Assignee: Pablo Langa Blanco

>  CREATE FUNCTION should look up catalog/table like v2 commands
> --
>
> Key: SPARK-30039
> URL: https://issues.apache.org/jira/browse/SPARK-30039
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Assignee: Pablo Langa Blanco
>Priority: Major
>
>  CREATE FUNCTION should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30039) CREATE FUNCTION should look up catalog/table like v2 commands

2020-01-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30039.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26890
[https://github.com/apache/spark/pull/26890]

>  CREATE FUNCTION should look up catalog/table like v2 commands
> --
>
> Key: SPARK-30039
> URL: https://issues.apache.org/jira/browse/SPARK-30039
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Pablo Langa Blanco
>Assignee: Pablo Langa Blanco
>Priority: Major
> Fix For: 3.0.0
>
>
>  CREATE FUNCTION should look up catalog/table like v2 commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30443) "Managed memory leak detected" even with no calls to take() or limit()

2020-01-07 Thread Luke Richter (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Richter updated SPARK-30443:
-
Description: 
Our Spark code is causing a "Managed memory leak detected" warning to appear, 
even though we are not calling take() or limit().

According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 
managed memory leaks should only be caused by not reading an iterator to 
completion, i.e. take() or limit()

Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed 
memory leak detected; size = 2097152 bytes, TID = 118"
 The size of the managed memory leak is always 2MB.

I have created a minimal test program that reproduces the warning: 
{code:java}
import pyspark.sql
import pyspark.sql.functions as fx


def main():
builder = pyspark.sql.SparkSession.builder
builder = builder.appName("spark-jira")
spark = builder.getOrCreate()

reader = spark.read
reader = reader.format("csv")
reader = reader.option("inferSchema", "true")
reader = reader.option("header", "true")

table_c = reader.load("c.csv")
table_a = reader.load("a.csv")
table_b = reader.load("b.csv")

primary_filter = fx.col("some_code").isNull()

new_primary_data = table_a.filter(primary_filter)

new_ids = new_primary_data.select("some_id")

new_data = table_b.join(new_ids, "some_id")

new_data = new_data.select("some_id")
result = table_c.join(new_data, "some_id", "left")

result.repartition(1).write.json("results.json", mode="overwrite")

spark.stop()


if __name__ == "__main__":
main()
{code}
Our code isn't anything out of the ordinary, just some filters, selects and 
joins.

The input data is made up of 3 CSV files. The input data files are quite large, 
roughly 2.6GB in total uncompressed. I attempted to reduce the number of rows 
in the CSV input files but this caused the warning to no longer appear. After 
compressing the files I was able to attach them below.

  was:
Our Spark code is causing a "Managed memory leak detected" warning to appear, 
even though we are not calling take() or limit().


 According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 
managed memory leaks should only be caused by not reading an iterator to 
completion, i.e. take() or limit()

Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed 
memory leak detected; size = 2097152 bytes, TID = 118"
 The size of the managed memory leak is always 2MB.

I have created a minimal test program that reproduces the warning: 
{code:java}
import pyspark.sql
import pyspark.sql.functions as fx


def main():
builder = pyspark.sql.SparkSession.builder
builder = builder.appName("spark-jira")
spark = builder.getOrCreate()

reader = spark.read
reader = reader.format("csv")
reader = reader.option("inferSchema", "true")
reader = reader.option("header", "true")

table_c = reader.load("c.csv")
table_a = reader.load("a.csv")
table_b = reader.load("b.csv")

primary_filter = fx.col("some_code").isNull()

new_primary_data = table_a.filter(primary_filter)

new_ids = new_primary_data.select("some_id")

new_data = table_b.join(new_ids, "some_id")

new_data = new_data.select("some_id")
result = table_c.join(new_data, "some_id", "left")

result.repartition(1).write.json("results.json", mode="overwrite")

spark.stop()


if __name__ == "__main__":
main()
{code}

 Our code isn't anything out of the ordinary, just some filters, selects and 
joins.

The input data is made up of 3 CSV files. The input data files are quite large, 
roughly 2.6GB in total uncompressed. I attempted to reduce the number of rows 
in the CSV input files but this caused the warning to no longer appear. What is 
the best way to get these test data files that reproduce the warning into your 
hands?


> "Managed memory leak detected" even with no calls to take() or limit()
> --
>
> Key: SPARK-30443
> URL: https://issues.apache.org/jira/browse/SPARK-30443
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.4
>Reporter: Luke Richter
>Priority: Major
> Attachments: a.csv.zip, b.csv.zip, c.csv.zip
>
>
> Our Spark code is causing a "Managed memory leak detected" warning to appear, 
> even though we are not calling take() or limit().
> According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 
> managed memory leaks should only be caused by not reading an iterator to 
> completion, i.e. take() or limit()
> Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed 
> memory leak detected; size = 2097152 bytes, TID = 118"
>  The size of the managed memory leak is always 2MB.
> I have 

[jira] [Updated] (SPARK-30443) "Managed memory leak detected" even with no calls to take() or limit()

2020-01-07 Thread Luke Richter (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Richter updated SPARK-30443:
-
Attachment: a.csv.zip
b.csv.zip
c.csv.zip

> "Managed memory leak detected" even with no calls to take() or limit()
> --
>
> Key: SPARK-30443
> URL: https://issues.apache.org/jira/browse/SPARK-30443
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.4
>Reporter: Luke Richter
>Priority: Major
> Attachments: a.csv.zip, b.csv.zip, c.csv.zip
>
>
> Our Spark code is causing a "Managed memory leak detected" warning to appear, 
> even though we are not calling take() or limit().
>  According to SPARK-14168 https://issues.apache.org/jira/browse/SPARK-14168 
> managed memory leaks should only be caused by not reading an iterator to 
> completion, i.e. take() or limit()
> Our exact warning text is: "2020-01-06 14:54:59 WARN Executor:66 - Managed 
> memory leak detected; size = 2097152 bytes, TID = 118"
>  The size of the managed memory leak is always 2MB.
> I have created a minimal test program that reproduces the warning: 
> {code:java}
> import pyspark.sql
> import pyspark.sql.functions as fx
> def main():
> builder = pyspark.sql.SparkSession.builder
> builder = builder.appName("spark-jira")
> spark = builder.getOrCreate()
> reader = spark.read
> reader = reader.format("csv")
> reader = reader.option("inferSchema", "true")
> reader = reader.option("header", "true")
> table_c = reader.load("c.csv")
> table_a = reader.load("a.csv")
> table_b = reader.load("b.csv")
> primary_filter = fx.col("some_code").isNull()
> new_primary_data = table_a.filter(primary_filter)
> new_ids = new_primary_data.select("some_id")
> new_data = table_b.join(new_ids, "some_id")
> new_data = new_data.select("some_id")
> result = table_c.join(new_data, "some_id", "left")
> result.repartition(1).write.json("results.json", mode="overwrite")
> spark.stop()
> if __name__ == "__main__":
> main()
> {code}
>  Our code isn't anything out of the ordinary, just some filters, selects and 
> joins.
> The input data is made up of 3 CSV files. The input data files are quite 
> large, roughly 2.6GB in total uncompressed. I attempted to reduce the number 
> of rows in the CSV input files but this caused the warning to no longer 
> appear. What is the best way to get these test data files that reproduce the 
> warning into your hands?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30437) Uneven spaces for some fields in EXPLAIN FORMATTED

2020-01-07 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30437.
--
Resolution: Won't Fix

> Uneven spaces for some fields in EXPLAIN FORMATTED
> --
>
> Key: SPARK-30437
> URL: https://issues.apache.org/jira/browse/SPARK-30437
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Aman Omer
>Priority: Minor
>
> Output of EXPLAIN EXTENDED have uneven spaces. Eg,
> {code:java}
> (4) Project [codegen id : 1]
> Output: [key#x, val#x]
> Input : [key#x, val#x]
>  
> (5) HashAggregate [codegen id : 1]
> Input: [key#x, val#x]
> {code}
> Unlike input field for HashAggregate, Output and Input fields of Project have 
> more spaces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27604) Enhance constant and constraint propagation

2020-01-07 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-27604:
---
Issue Type: Improvement  (was: Bug)

> Enhance constant and constraint propagation
> ---
>
> Key: SPARK-27604
> URL: https://issues.apache.org/jira/browse/SPARK-27604
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Priority: Major
>
> There is some room for improvement as constant propagation could allow 
> substitution of deterministic expressions (instead of attributes only) to 
> constants and substitutions in other than equal predicates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27604) Enhance constant and constraint propagation

2020-01-07 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-27604:
---
Description: There is some room for improvement as constant propagation 
could allow substitution of deterministic expressions (instead of attributes 
only) to constants and substitutions in other than equal predicates.  (was: 
There is a bug in constant propagation due to null handling:

{{SELECT * FROM t WHERE NOT(c = 1 AND c + 1 = 1)}} returns those rows where 
{{c}} is null due to {{1 + 1 = 1}} propagation

There is some room for improvement as constant propagation could allow 
substitution of deterministic expressions (instead of attributes only) to 
constants and substitutions in other than equal predicates.)

> Enhance constant and constraint propagation
> ---
>
> Key: SPARK-27604
> URL: https://issues.apache.org/jira/browse/SPARK-27604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Priority: Major
>
> There is some room for improvement as constant propagation could allow 
> substitution of deterministic expressions (instead of attributes only) to 
> constants and substitutions in other than equal predicates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30447) Constant propagation nullability issue

2020-01-07 Thread Peter Toth (Jira)
Peter Toth created SPARK-30447:
--

 Summary: Constant propagation nullability issue
 Key: SPARK-30447
 URL: https://issues.apache.org/jira/browse/SPARK-30447
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Peter Toth


There is a bug in constant propagation due to null handling:

SELECT * FROM t WHERE NOT(c = 1 AND c + 1 = 1) returns those rows where c is 
null due to 1 + 1 = 1 propagation, but it shouldn't.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30446) Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases

2020-01-07 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009830#comment-17009830
 ] 

Thomas Graves commented on SPARK-30446:
---

Yeah so running on standalone if you set the spark.task.cpus=2 (or anything > 
1) and you don't set executor cores it fails even though it shouldn't because 
executor cores are all the cores of the worker by default:

 

20/01/07 09:34:02 ERROR Main: Failed to initialize Spark session.
org.apache.spark.SparkException: The number of cores per executor (=1) has to 
be >= the task config: spark.task.cpus = 2 when run on spark://tomg-x299:7077.

> Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases
> --
>
> Key: SPARK-30446
> URL: https://issues.apache.org/jira/browse/SPARK-30446
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> with accelerator aware scheduling SparkContext.checkResourcesPerTask
> Tries to make sure that users have configured things properly and warn or 
> error if not.
> It doesn't properly handle all cases like warning if cpu resources are being 
> wasted.  We should test this better and handle those. 
> I fixed these in the stage level scheduling but not sure the timeline on 
> getting that in so we may want to fix this separately as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30431) Update SqlBase.g4 to create commentSpec pattern as same as locationSpec

2020-01-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30431.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27102
[https://github.com/apache/spark/pull/27102]

> Update SqlBase.g4 to create commentSpec pattern as same as locationSpec
> ---
>
> Key: SPARK-30431
> URL: https://issues.apache.org/jira/browse/SPARK-30431
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.0.0
>
>
> In `SqlBase.g4`, the comment clause is used as `COMMENT comment=STRING` and 
> `COMMENT STRING` in many places.
> While the location clause often appears along with the comment clause with a 
> pattern defined as 
> {code:sql}
> locationSpec
> : LOCATION STRING
> ;
> {code}
> Then, we have to visit locationSpec as a List but comment as a single token



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30431) Update SqlBase.g4 to create commentSpec pattern as same as locationSpec

2020-01-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30431:
---

Assignee: Kent Yao

> Update SqlBase.g4 to create commentSpec pattern as same as locationSpec
> ---
>
> Key: SPARK-30431
> URL: https://issues.apache.org/jira/browse/SPARK-30431
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
>
> In `SqlBase.g4`, the comment clause is used as `COMMENT comment=STRING` and 
> `COMMENT STRING` in many places.
> While the location clause often appears along with the comment clause with a 
> pattern defined as 
> {code:sql}
> locationSpec
> : LOCATION STRING
> ;
> {code}
> Then, we have to visit locationSpec as a List but comment as a single token



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30446) Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases

2020-01-07 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009742#comment-17009742
 ] 

Thomas Graves commented on SPARK-30446:
---

I think there may also be issues in it with standalone mode since Executor 
cores isn't necessarily right, but I would have to test again to verify that.

> Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases
> --
>
> Key: SPARK-30446
> URL: https://issues.apache.org/jira/browse/SPARK-30446
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> with accelerator aware scheduling SparkContext.checkResourcesPerTask
> Tries to make sure that users have configured things properly and warn or 
> error if not.
> It doesn't properly handle all cases like warning if cpu resources are being 
> wasted.  We should test this better and handle those. 
> I fixed these in the stage level scheduling but not sure the timeline on 
> getting that in so we may want to fix this separately as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30446) Accelerator aware scheduling checkResourcesPerTask doesn't cover all cases

2020-01-07 Thread Thomas Graves (Jira)
Thomas Graves created SPARK-30446:
-

 Summary: Accelerator aware scheduling checkResourcesPerTask 
doesn't cover all cases
 Key: SPARK-30446
 URL: https://issues.apache.org/jira/browse/SPARK-30446
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Thomas Graves


with accelerator aware scheduling SparkContext.checkResourcesPerTask

Tries to make sure that users have configured things properly and warn or error 
if not.

It doesn't properly handle all cases like warning if cpu resources are being 
wasted.  We should test this better and handle those. 

I fixed these in the stage level scheduling but not sure the timeline on 
getting that in so we may want to fix this separately as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30445) Accelerator aware scheduling handle setting configs to 0 better

2020-01-07 Thread Thomas Graves (Jira)
Thomas Graves created SPARK-30445:
-

 Summary: Accelerator aware scheduling handle setting configs to 0 
better
 Key: SPARK-30445
 URL: https://issues.apache.org/jira/browse/SPARK-30445
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Thomas Graves


If you set the resource configs to 0, it errors with divide by zero. While I 
think ideally the user should just remove the configs we should handle the 0 
better.

 

{color:#1d1c1d}$ spark-submit --conf spark.driver.resource.gpu.amount=0 
{color}*--conf spark.executor.resource.gpu.amount=0*{color:#1d1c1d} 
{color}*--conf spark.task.resource.gpu.amount=0*{color:#1d1c1d} --conf 
spark.driver.resource.gpu.discoveryScript=/shared/tools/get_gpu_resources.sh 
--conf 
spark.executor.resource.gpu.discoveryScript=/shared/tools/get_gpu_resources.sh 
test.py{color}
{color:#1d1c1d}20/01/07 05:36:42 WARN NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable{color}
{color:#1d1c1d}Using Spark’s default log4j profile: 
org/apache/spark/log4j-defaults.properties{color}
{color:#1d1c1d}20/01/07 05:36:43 INFO SparkContext: {color}*Running Spark 
version 3.0.0-preview*
{color:#1d1c1d}20/01/07 05:36:43 INFO ResourceUtils: 
=={color}
{color:#1d1c1d}20/01/07 05:36:43 INFO ResourceUtils: Resources for 
spark.driver:{color}
*gpu -> [name: gpu, addresses: 0]*
{color:#1d1c1d}20/01/07 05:36:43 INFO ResourceUtils: 
=={color}
{color:#1d1c1d}20/01/07 05:36:43 INFO SparkContext: Submitted application: 
test.py{color}
{color:#1d1c1d}..{color}
{color:#1d1c1d}20/01/07 05:36:43 ERROR SparkContext: Error initializing 
SparkContext.{color}
*java.lang.ArithmeticException: / by zero*
{color:#1d1c1d}at 
org.apache.spark.SparkContext$.$anonfun$createTaskScheduler$3(SparkContext.scala:2793){color}
{color:#1d1c1d}at 
org.apache.spark.SparkContext$.$anonfun$createTaskScheduler$3$adapted(SparkContext.scala:2775){color}
{color:#1d1c1d}at scala.collection.Iterator.foreach(Iterator.scala:941){color}
{color:#1d1c1d}at scala.collection.Iterator.foreach$(Iterator.scala:941){color}
{color:#1d1c1d}at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1429){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30338) Avoid unnecessary InternalRow copies in ParquetRowConverter

2020-01-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30338.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26993
[https://github.com/apache/spark/pull/26993]

> Avoid unnecessary InternalRow copies in ParquetRowConverter
> ---
>
> Key: SPARK-30338
> URL: https://issues.apache.org/jira/browse/SPARK-30338
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Fix For: 3.0.0
>
>
> ParquetRowConverter calls {{InternalRow.copy()}} in cases where the copy is 
> unnecessary; this can severely harm performance when reading deeply-nested 
> Parquet.
> It looks like this copying was originally added to handle arrays and maps of 
> structs (in which case we need to keep the copying), but we can omit it for 
> the more common case of structs nested directly in structs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30429) WideSchemaBenchmark fails with OOM

2020-01-07 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17009468#comment-17009468
 ] 

L. C. Hsieh commented on SPARK-30429:
-

Thanks for pinging me. Looking into this.

> WideSchemaBenchmark fails with OOM
> --
>
> Key: SPARK-30429
> URL: https://issues.apache.org/jira/browse/SPARK-30429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
> Attachments: WideSchemaBenchmark_console.txt
>
>
> Run WideSchemaBenchmark on the master (commit 
> bc16bb1dd095c9e1c8deabf6ac0d528441a81d88) via:
> {code}
> SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
> org.apache.spark.sql.execution.benchmark.WideSchemaBenchmark"
> {code}
> This fails with:
> {code}
> Caused by: java.lang.reflect.InvocationTargetException
> [error]   at 
> sun.reflect.GeneratedConstructorAccessor8.newInstance(Unknown Source)
> [error]   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> [error]   at 
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> [error]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$7(TreeNode.scala:468)
> [error]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
> [error]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$makeCopy$1(TreeNode.scala:467)
> [error]   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
> [error]   ... 132 more
> [error] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> [error]   at java.util.Arrays.copyOfRange(Arrays.java:3664)
> [error]   at java.lang.String.(String.java:207)
> [error]   at java.lang.StringBuilder.toString(StringBuilder.java:407)
> [error]   at 
> org.apache.spark.sql.types.StructType.catalogString(StructType.scala:411)
> [error]   at 
> org.apache.spark.sql.types.StructType.$anonfun$catalogString$1(StructType.scala:410)
> [error]   at 
> org.apache.spark.sql.types.StructType$$Lambda$2441/1040526643.apply(Unknown 
> Source)
> {code}
> Full stack dump is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org