date:20190822

[jira] [Commented] (SPARK-28836) Improve canonicalize API

2019-08-22 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913973#comment-16913973
 ] 

Dongjoon Hyun commented on SPARK-28836:
---

This issue content is switched with SPARK-28835.

> Improve canonicalize API
> 
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR improves the `canonicalize` API by removing the method `def 
> canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and 
> taking care of normalizing expressions in `QueryPlan`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28835) Introduce TPCDSSchema

2019-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28835.
---
Fix Version/s: 3.0.0
 Assignee: Ali Afroozeh
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/25535

> Introduce TPCDSSchema
> -
>
> Key: SPARK-28835
> URL: https://issues.apache.org/jira/browse/SPARK-28835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Minor
> Fix For: 3.0.0
>
>
> This PR extracts the schema information of TPCDS tables into a separate class 
> called `TPCDSSchema` which can be reused for other testing purposes



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28836) Improve canonicalize API

2019-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28836:
--
Description: This PR improves the `canonicalize` API by removing the method 
`def canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` 
and taking care of normalizing expressions in `QueryPlan`.  (was: This PR 
extracts the schema information of TPCDS tables into a separate class called 
`TPCDSSchema` which can be reused for other testing purposes)

> Improve canonicalize API
> 
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR improves the `canonicalize` API by removing the method `def 
> canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and 
> taking care of normalizing expressions in `QueryPlan`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28835) Introduce TPCDSSchema

2019-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28835:
--
Description: This PR extracts the schema information of TPCDS tables into a 
separate class called `TPCDSSchema` which can be reused for other testing 
purposes  (was: This PR improves the `canonicalize` API by removing the method 
`def canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` 
and taking care of normalizing expressions in `QueryPlan`.)

> Introduce TPCDSSchema
> -
>
> Key: SPARK-28835
> URL: https://issues.apache.org/jira/browse/SPARK-28835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR extracts the schema information of TPCDS tables into a separate class 
> called `TPCDSSchema` which can be reused for other testing purposes



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28836) Improve canonicalize API

2019-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28836:
--
Summary: Improve canonicalize API  (was: Introduce TPCDSSchema)

> Improve canonicalize API
> 
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR extracts the schema information of TPCDS tables into a separate class 
> called `TPCDSSchema` which can be reused for other testing purposes



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28835) Introduce TPCDSSchema

2019-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28835:
--
Summary: Introduce TPCDSSchema  (was: Improve canonicalize API)

> Introduce TPCDSSchema
> -
>
> Key: SPARK-28835
> URL: https://issues.apache.org/jira/browse/SPARK-28835
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR improves the `canonicalize` API by removing the method `def 
> canonicalize(attrs: AttributeSeq): PlanExpression[T]` in `PlanExpression` and 
> taking care of normalizing expressions in `QueryPlan`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-28836) Introduce TPCDSSchema

2019-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-28836:
---

> Introduce TPCDSSchema
> -
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR extracts the schema information of TPCDS tables into a separate class 
> called `TPCDSSchema` which can be reused for other testing purposes



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28836) Introduce TPCDSSchema

2019-08-22 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913971#comment-16913971
 ] 

Dongjoon Hyun commented on SPARK-28836:
---

Oops. Sorry, [~hyukjin.kwon]. I merged this because the PR is open with a wrong 
Jira id.

- https://github.com/apache/spark/pull/25535

> Introduce TPCDSSchema
> -
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR extracts the schema information of TPCDS tables into a separate class 
> called `TPCDSSchema` which can be reused for other testing purposes



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28319) DataSourceV2: Support SHOW TABLES

2019-08-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28319.
-
Fix Version/s: 3.0.0
 Assignee: Terry Kim
   Resolution: Fixed

> DataSourceV2: Support SHOW TABLES
> -
>
> Key: SPARK-28319
> URL: https://issues.apache.org/jira/browse/SPARK-28319
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> SHOW TABLES needs to support v2 catalogs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-22 Thread Shixiong Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-28025:


Assignee: Jungtaek Lim

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28025) HDFSBackedStateStoreProvider should not leak .crc files

2019-08-22 Thread Shixiong Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-28025.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

> HDFSBackedStateStoreProvider should not leak .crc files 
> 
>
> Key: SPARK-28025
> URL: https://issues.apache.org/jira/browse/SPARK-28025
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
> Environment: Spark 2.4.3
> Kubernetes 1.11(?) (OpenShift)
> StateStore storage on a mounted PVC. Viewed as a local filesystem by the 
> `FileContextBasedCheckpointFileManager` : 
> {noformat}
> scala> glusterfm.isLocal
> res17: Boolean = true{noformat}
>Reporter: Gerard Maas
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> The HDFSBackedStateStoreProvider when using the default CheckpointFileManager 
> is leaving '.crc' files behind. There's a .crc file created for each 
> `atomicFile` operation of the CheckpointFileManager.
> Over time, the number of files becomes very large. It makes the state store 
> file system constantly increase in size and, in our case, deteriorates the 
> file system performance.
> Here's a sample of one of our spark storage volumes after 2 days of execution 
> (4 stateful streaming jobs, each on a different sub-dir):
>  # 
> {noformat}
> Total files in PVC (used for checkpoints and state store)
> $find . | wc -l
> 431796
> # .crc files
> $find . -name "*.crc" | wc -l
> 418053{noformat}
> With each .crc file taking one storage block, the used storage runs into the 
> GBs of data.
> These jobs are running on Kubernetes. Our shared storage provider, GlusterFS, 
> shows serious performance deterioration with this large number of files:
> {noformat}
> DEBUG HDFSBackedStateStoreProvider: fetchFiles() took 29164ms{noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2019-08-22 Thread hemanth meka (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913965#comment-16913965
 ] 

hemanth meka commented on SPARK-23519:
--

I have a fix for this. checkColumnNameDuplication is checking analyzed 
schema(id, id) whereas it should be checking aliased schema(int1, int2). I got 
it to work. I will run tests and submit a PR.

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
>  Labels: bulk-closed
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28730) Configurable type coercion policy for table insertion

2019-08-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28730.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25453
[https://github.com/apache/spark/pull/25453]

> Configurable type coercion policy for table insertion
> -
>
> Key: SPARK-28730
> URL: https://issues.apache.org/jira/browse/SPARK-28730
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> After all the discussions in the dev list: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Discuss-Follow-ANSI-SQL-on-table-insertion-td27531.html#a27562.
>  
> Here I propose that we can make the store assignment rules in the analyzer 
> configurable, and the behavior of V1 and V2 should be consistent.
> When inserting a value into a column with a different data type, Spark will 
> perform type coercion. After this PR, we support 2 policies for the type 
> coercion rules: 
> legacy and strict. 
> 1. With legacy policy, Spark allows casting any value to any data type and 
> null result is returned when the conversion is invalid. The legacy policy is 
> the only behavior in Spark 2.x and it is compatible with Hive. 
> 2. With strict policy, Spark doesn't allow any possible precision loss or 
> data truncation in type coercion, e.g. `int` and `long`, `float` -> `double` 
> are not allowed.
> Eventually, the "legacy" mode will be removed, so it is disallowed in data 
> source V2.
> To ensure backward compatibility with existing queries, the default store 
> assignment policy for data source V1 is "legacy" before ANSI mode is 
> implemented.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28730) Configurable type coercion policy for table insertion

2019-08-22 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28730:
---

Assignee: Gengliang Wang

> Configurable type coercion policy for table insertion
> -
>
> Key: SPARK-28730
> URL: https://issues.apache.org/jira/browse/SPARK-28730
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> After all the discussions in the dev list: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Discuss-Follow-ANSI-SQL-on-table-insertion-td27531.html#a27562.
>  
> Here I propose that we can make the store assignment rules in the analyzer 
> configurable, and the behavior of V1 and V2 should be consistent.
> When inserting a value into a column with a different data type, Spark will 
> perform type coercion. After this PR, we support 2 policies for the type 
> coercion rules: 
> legacy and strict. 
> 1. With legacy policy, Spark allows casting any value to any data type and 
> null result is returned when the conversion is invalid. The legacy policy is 
> the only behavior in Spark 2.x and it is compatible with Hive. 
> 2. With strict policy, Spark doesn't allow any possible precision loss or 
> data truncation in type coercion, e.g. `int` and `long`, `float` -> `double` 
> are not allowed.
> Eventually, the "legacy" mode will be removed, so it is disallowed in data 
> source V2.
> To ensure backward compatibility with existing queries, the default store 
> assignment policy for data source V1 is "legacy" before ANSI mode is 
> implemented.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28857) Clean up the comments of PR template during merging

2019-08-22 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-28857:
-

 Summary: Clean up the comments of PR template during merging
 Key: SPARK-28857
 URL: https://issues.apache.org/jira/browse/SPARK-28857
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28702) Display useful error message (instead of NPE) for invalid Dataset operations (e.g. calling actions inside of transformations)

2019-08-22 Thread Josh Rosen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-28702:
--

Assignee: Shivu Sondur

> Display useful error message (instead of NPE) for invalid Dataset operations 
> (e.g. calling actions inside of transformations)
> -
>
> Key: SPARK-28702
> URL: https://issues.apache.org/jira/browse/SPARK-28702
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Assignee: Shivu Sondur
>Priority: Major
>
> In Spark, SparkContext and SparkSession can only be used on the driver, not 
> on executors. For example, this means that you cannot call 
> {{someDataset.collect()}} inside of a Dataset or RDD transformation.
> When Spark serializes RDDs and Datasets, references to SparkContext and 
> SparkSession are null'ed out (by being marked as {{@transient}} or via the 
> Closure Cleaner). As a result, RDD and Dataset methods which reference use 
> these driver-side-only objects (e.g. actions or transformations) will see 
> {{null}} references and may fail with a {{NullPointerException}}. For 
> example, in code which (via a chain of calls) tried to {{collect()}} a 
> dataset inside of a Dataset.map operation:
> {code:java}Caused by: java.lang.NullPointerException
> at 
> $apache$spark$sql$Dataset$$rddQueryExecution$lzycompute(Dataset.scala:3027)
> at 
> $apache$spark$sql$Dataset$$rddQueryExecution(Dataset.scala:3025)
> at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3038)
> at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3036)
> [...] {code}
> The resulting NPE can be _very_ confusing to users.
> In SPARK-5063 I added some logic to throw clearer error messages when 
> performing similar invalid actions on RDDs. This ticket's scope is to 
> implement similar logic for Datasets.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28702) Display useful error message (instead of NPE) for invalid Dataset operations (e.g. calling actions inside of transformations)

2019-08-22 Thread Josh Rosen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-28702.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25503
[https://github.com/apache/spark/pull/25503]

> Display useful error message (instead of NPE) for invalid Dataset operations 
> (e.g. calling actions inside of transformations)
> -
>
> Key: SPARK-28702
> URL: https://issues.apache.org/jira/browse/SPARK-28702
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Assignee: Shivu Sondur
>Priority: Major
> Fix For: 3.0.0
>
>
> In Spark, SparkContext and SparkSession can only be used on the driver, not 
> on executors. For example, this means that you cannot call 
> {{someDataset.collect()}} inside of a Dataset or RDD transformation.
> When Spark serializes RDDs and Datasets, references to SparkContext and 
> SparkSession are null'ed out (by being marked as {{@transient}} or via the 
> Closure Cleaner). As a result, RDD and Dataset methods which reference use 
> these driver-side-only objects (e.g. actions or transformations) will see 
> {{null}} references and may fail with a {{NullPointerException}}. For 
> example, in code which (via a chain of calls) tried to {{collect()}} a 
> dataset inside of a Dataset.map operation:
> {code:java}Caused by: java.lang.NullPointerException
> at 
> $apache$spark$sql$Dataset$$rddQueryExecution$lzycompute(Dataset.scala:3027)
> at 
> $apache$spark$sql$Dataset$$rddQueryExecution(Dataset.scala:3025)
> at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3038)
> at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3036)
> [...] {code}
> The resulting NPE can be _very_ confusing to users.
> In SPARK-5063 I added some logic to throw clearer error messages when 
> performing similar invalid actions on RDDs. This ticket's scope is to 
> implement similar logic for Datasets.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28832) Document SHOW SCHEMAS statement in SQL Reference.

2019-08-22 Thread jobit mathew (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jobit mathew resolved SPARK-28832.
--
Resolution: Duplicate

Closing the JIRA as it will cover as a part of SHOW DATABASES statement 
itself.https://issues.apache.org/jira/browse/SPARK-28807

> Document SHOW SCHEMAS statement in SQL Reference.
> -
>
> Key: SPARK-28832
> URL: https://issues.apache.org/jira/browse/SPARK-28832
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2019-08-22 Thread Liang-Chi Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913915#comment-16913915
 ] 

Liang-Chi Hsieh commented on SPARK-23519:
-

Thanks for pinging me.

I am going on a flight soon. If this is not urgent, I can look into it after 
today.

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
>  Labels: bulk-closed
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28827) Document SELECT CURRENT_DATABASE in SQL Reference

2019-08-22 Thread Shivu Sondur (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913903#comment-16913903
 ] 

Shivu Sondur commented on SPARK-28827:
--

i will work on this

> Document SELECT CURRENT_DATABASE in SQL Reference
> -
>
> Key: SPARK-28827
> URL: https://issues.apache.org/jira/browse/SPARK-28827
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28823) Document CREATE ROLE Statement

2019-08-22 Thread Shivu Sondur (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913902#comment-16913902
 ] 

Shivu Sondur commented on SPARK-28823:
--

i will work on this

> Document CREATE ROLE Statement 
> ---
>
> Key: SPARK-28823
> URL: https://issues.apache.org/jira/browse/SPARK-28823
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28482) Data incomplete when using pandas udf in Python 3

2019-08-22 Thread jiangyu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913879#comment-16913879
 ] 

jiangyu commented on SPARK-28482:
-

hi, [~bryanc] ， i have tested toPandas(), it is okay. Row numbers is correct 
and no exception throw.

I used df_result.rdd.foreachPartition(trigger_func) at beginning when use 
python 2.7 to tigger pandas udf , everything is fine. When changed to python 
3.6, this method seemed not stable. I will change the method to toPandas(). 
Thank you.

> Data incomplete when using pandas udf in Python 3
> -
>
> Key: SPARK-28482
> URL: https://issues.apache.org/jira/browse/SPARK-28482
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 2.4.3
> Environment: centos 7.4   
> pyarrow 0.10.0 0.14.0
> python 2.7 3.5 3.6
>Reporter: jiangyu
>Priority: Major
> Attachments: py2.7.png, py3.6.png, test.csv, test.py, worker.png
>
>
> Hi,
>   
>  Since Spark 2.3.x, pandas udf has been introduced as default ser/des method 
> when using udf. However, an issue raises with python >= 3.5.x version.
>  We use pandas udf to process batches of data, but we find the data is 
> incomplete in python 3.x. At first , i think the process logical maybe wrong, 
> so i change the code to very simple one and it has the same problem.After 
> investigate for a week, i find it is related to pyarrow.   
>   
>  *Reproduce procedure:*
> 1. prepare data
>  The data have seven column, a、b、c、d、e、f and g, data type is Integer
>  a,b,c,d,e,f,g
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>   produce 100,000 rows and name the file test.csv ,upload to hdfs, then load 
> it , and repartition it to 1 partition.
>   
> {code:java}
> df=spark.read.format('csv').option("header","true").load('/test.csv')
> df=df.select(*(col(c).cast("int").alias(c) for c in df.columns))
> df=df.repartition(1)
> spark_context = SparkContext.getOrCreate() {code}
>  
>  2.register pandas udf
>   
> {code:java}
> def add_func(a,b,c,d,e,f,g):
> print('iterator one time')
> return a
> add = pandas_udf(add_func, returnType=IntegerType())
> df_result=df.select(add(col("a"),col("b"),col("c"),col("d"),col("e"),col("f"),col("g"))){code}
>  
>  3.apply pandas udf
>   
> {code:java}
> def trigger_func(iterator):
>       yield iterator
> df_result.rdd.foreachPartition(trigger_func){code}
>  
>  4.execute it in pyspark (local or yarn)
>  run it with conf spark.sql.execution.arrow.maxRecordsPerBatch=10. As 
> mentioned before the total row number is 100, it should print "iterator 
> one time " 10 times.
>  (1)Python 2.7 envs:
>   
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/py2.7/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark.executor.pyspark.memory=2g --conf 
> spark.sql.execution.arrow.enabled=true --executor-cores 1{code}
>  
>  !py2.7.png!   
>  The result is right, 10 times of print.
>  
>  
> (2)Python 3.5 or 3.6 envs:
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/python3.6/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark.executor.pyspark.memory=2g --conf 
> spark.sql.execution.arrow.enabled=true --executor-cores{code}
>  
> !py3.6.png!
> The data is incomplete. Exception is print by jvm spark which have been added 
> by us , I will explain it later.
>   
>   
> h3. *Investigation*
> The “process done” is added in the worker.py.
>  !worker.png!
>  In order to get the exception,  change the spark code, the code is under 
> core/src/main/scala/org/apache/spark/util/Utils.scala , and add this code to 
> print the exception.
>   
>  
> {code:java}
> @@ -1362,6 +1362,8 @@ private[spark] object Utils extends Logging {
>  case t: Throwable =>
>  // Purposefully not using NonFatal, because even fatal exceptions
>  // we don't want to have our finallyBlock suppress
> + logInfo(t.getLocalizedMessage)
> + t.printStackTrace()
>  originalThrowable = t
>  throw originalThrowable
>  } finally {{code}
>  
>  
>  It seems the pyspark get the data from jvm , but pyarrow get the data 
> incomplete. Pyarrow side think the data is finished, then shutdown the 
> socket. At the same time, the jvm side still writes to the same socket , but 
> get socket close exception.
>  The pyarrow part is in ipc.pxi:
>   
> {code:java}
> cdef class _RecordBatchReader:
>  cdef:
>  shared_ptr[CRecordBatchReader] reader
>  shared_ptr[InputStream] in_stream
> cdef readonly:
>  Schema schema
> def _cinit_(self):
>  pass
> def _open(self, source):
>  get_input_stream(source, &self.in_stream)
>  with nogil:
>  check_status(CRecordBatchStreamReader.Open(
>  self.in_stream.get(), &self.reader))
> self.schema = pyarrow_wrap_schema(self.reader.get().schema())
> def _iter

[jira] [Commented] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2019-08-22 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913855#comment-16913855
 ] 

Wenchen Fan commented on SPARK-23519:
-

I think this is a bug and should be fixed. cc [~viirya] do you have any clues 
about this bug?

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
>  Labels: bulk-closed
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27594) spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be read incorrectly

2019-08-22 Thread Owen O'Malley (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913665#comment-16913665
 ] 

Owen O'Malley commented on SPARK-27594:
---

This is being caused by an ORC bug that was backported in the Hortonworks' 
version of ORC.

> spark.sql.orc.enableVectorizedReader causes milliseconds in Timestamp to be 
> read incorrectly
> 
>
> Key: SPARK-27594
> URL: https://issues.apache.org/jira/browse/SPARK-27594
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jan-Willem van der Sijp
>Priority: Major
>
> Using {{spark.sql.orc.impl=native}} and 
> {{spark.sql.orc.enableVectorizedReader=true}} causes reading of TIMESTAMP 
> columns in HIVE stored as ORC to be interpreted incorrectly. Specifically, 
> the milliseconds of time timestamp will be doubled.
> Input/output of a Zeppelin session to demonstrate:
> {code:python}
> %pyspark
> from pprint import pprint
> spark.conf.set("spark.sql.orc.impl", "native")
> spark.conf.set("spark.sql.orc.enableVectorizedReader", "true")
> pprint(spark.sparkContext.getConf().getAll())
> 
> [('sql.stacktrace', 'false'),
>  ('spark.eventLog.enabled', 'true'),
>  ('spark.app.id', 'application_1556200632329_0005'),
>  ('importImplicit', 'true'),
>  ('printREPLOutput', 'true'),
>  ('spark.history.ui.port', '18081'),
>  ('spark.driver.extraLibraryPath',
>   
> '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'),
>  ('spark.driver.extraJavaOptions',
>   ' -Dfile.encoding=UTF-8 '
>   
> '-Dlog4j.configuration=file:///usr/hdp/current/zeppelin-server/conf/log4j.properties
>  '
>   
> '-Dzeppelin.log.file=/var/log/zeppelin/zeppelin-interpreter-spark2-spark-zeppelin-sandbox-hdp.hortonworks.com.log'),
>  ('concurrentSQL', 'false'),
>  ('spark.driver.port', '40195'),
>  ('spark.executor.extraLibraryPath',
>   
> '/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64'),
>  ('useHiveContext', 'true'),
>  ('spark.jars',
>   
> 'file:/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'),
>  ('spark.history.provider',
>   'org.apache.spark.deploy.history.FsHistoryProvider'),
>  ('spark.yarn.historyServer.address', 'sandbox-hdp.hortonworks.com:18081'),
>  ('spark.submit.deployMode', 'client'),
>  ('spark.ui.filters',
>   'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'),
>  
> ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS',
>   'sandbox-hdp.hortonworks.com'),
>  ('spark.eventLog.dir', 'hdfs:///spark2-history/'),
>  ('spark.repl.class.uri', 
> 'spark://sandbox-hdp.hortonworks.com:40195/classes'),
>  ('spark.driver.host', 'sandbox-hdp.hortonworks.com'),
>  ('master', 'yarn'),
>  ('spark.yarn.dist.archives',
>   '/usr/hdp/current/spark2-client/R/lib/sparkr.zip#sparkr'),
>  ('spark.scheduler.mode', 'FAIR'),
>  ('spark.yarn.queue', 'default'),
>  ('spark.history.kerberos.keytab',
>   '/etc/security/keytabs/spark.headless.keytab'),
>  ('spark.executor.id', 'driver'),
>  ('spark.history.fs.logDirectory', 'hdfs:///spark2-history/'),
>  ('spark.history.kerberos.enabled', 'false'),
>  ('spark.master', 'yarn'),
>  ('spark.sql.catalogImplementation', 'hive'),
>  ('spark.history.kerberos.principal', 'none'),
>  ('spark.driver.extraClassPath',
>   
> ':/usr/hdp/current/zeppelin-server/interpreter/spark/*:/usr/hdp/current/zeppelin-server/lib/interpreter/*::/usr/hdp/current/zeppelin-server/interpreter/spark/zeppelin-spark_2.11-0.7.3.2.6.5.0-292.jar'),
>  ('spark.driver.appUIAddress', 'http://sandbox-hdp.hortonworks.com:4040'),
>  ('spark.repl.class.outputDir',
>   '/tmp/spark-555b2143-0efa-45c1-aecc-53810f89aa5f'),
>  ('spark.yarn.isPython', 'true'),
>  ('spark.app.name', 'Zeppelin'),
>  
> ('spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES',
>   
> 'http://sandbox-hdp.hortonworks.com:8088/proxy/application_1556200632329_0005'),
>  ('maxResult', '1000'),
>  ('spark.executorEnv.PYTHONPATH',
>   
> '/usr/hdp/current/spark2-client//python/lib/py4j-0.10.6-src.zip:/usr/hdp/current/spark2-client//python/:/usr/hdp/current/spark2-client//python:/usr/hdp/current/spark2-client//python/lib/py4j-0.8.2.1-src.zip{{PWD}}/pyspark.zip{{PWD}}/py4j-0.10.6-src.zip'),
>  ('spark.ui.proxyBase', '/proxy/application_1556200632329_0005')]
> {code}
> {code:python}
> %pyspark
> spark.sql("""
> DROP TABLE IF EXISTS default.hivetest
> """)
> spark.sql("""
> CREATE TABLE default.hivetest (
> day DATE,
> time TIMESTAMP,
> timestring STRING
> )
> USING ORC
> """)
> {code}
> {code:python}
> %pyspark
> df1 = spark.createDataFrame(
> [
> ("2019-01-01",

[jira] [Assigned] (SPARK-28769) Improve warning message in Barrier Execution Mode in case required slots > maximum slots

2019-08-22 Thread Sean Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28769:
-

Assignee: Kousuke Saruta

> Improve warning message in Barrier Execution Mode in case required slots > 
> maximum slots
> 
>
> Key: SPARK-28769
> URL: https://issues.apache.org/jira/browse/SPARK-28769
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> In the current implementation of Barrier Execution Mode, if required slots > 
> maximum slots, we get following warning messages.
> {code}
> 19/08/18 15:18:09 WARN DAGScheduler: The job 2 requires to run a barrier 
> stage that requires more slots than the total number of slots in the cluster 
> currently.
> 19/08/18 15:18:24 WARN DAGScheduler: The job 2 requires to run a barrier 
> stage that requires more slots than the total number of slots in the cluster 
> currently.
> 19/08/18 15:18:39 WARN DAGScheduler: The job 2 requires to run a barrier 
> stage that requires more slots than the total number of slots in the cluster 
> currently.
> 19/08/18 15:18:54 WARN DAGScheduler: The job 2 requires to run a barrier 
> stage that requires more slots than the total number of slots in the cluster 
> currently.
> ...
> {code}
> If we can provide more information, it might help users to decide what they 
> should do.
> The following messages are one example.
> {code}
> 19/08/18 16:52:23 WARN DAGScheduler: The job 0 requires to run a barrier 
> stage that requires 3 slots than the total number of slots(2) in the cluster 
> currently.
> 19/08/18 16:52:38 WARN DAGScheduler: The job 0 requires to run a barrier 
> stage that requires 3 slots than the total number of slots(2) in the cluster 
> currently (Retry 1/3 failed).
> 19/08/18 16:52:53 WARN DAGScheduler: The job 0 requires to run a barrier 
> stage that requires 3 slots than the total number of slots(2) in the cluster 
> currently (Retry 2/3 failed).
> 19/08/18 16:53:08 WARN DAGScheduler: The job 0 requires to run a barrier 
> stage that requires 3 slots than the total number of slots(2) in the cluster 
> currently (Retry 3/3 failed).
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28769) Improve warning message in Barrier Execution Mode in case required slots > maximum slots

2019-08-22 Thread Sean Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28769.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25487
[https://github.com/apache/spark/pull/25487]

> Improve warning message in Barrier Execution Mode in case required slots > 
> maximum slots
> 
>
> Key: SPARK-28769
> URL: https://issues.apache.org/jira/browse/SPARK-28769
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.0.0
>
>
> In the current implementation of Barrier Execution Mode, if required slots > 
> maximum slots, we get following warning messages.
> {code}
> 19/08/18 15:18:09 WARN DAGScheduler: The job 2 requires to run a barrier 
> stage that requires more slots than the total number of slots in the cluster 
> currently.
> 19/08/18 15:18:24 WARN DAGScheduler: The job 2 requires to run a barrier 
> stage that requires more slots than the total number of slots in the cluster 
> currently.
> 19/08/18 15:18:39 WARN DAGScheduler: The job 2 requires to run a barrier 
> stage that requires more slots than the total number of slots in the cluster 
> currently.
> 19/08/18 15:18:54 WARN DAGScheduler: The job 2 requires to run a barrier 
> stage that requires more slots than the total number of slots in the cluster 
> currently.
> ...
> {code}
> If we can provide more information, it might help users to decide what they 
> should do.
> The following messages are one example.
> {code}
> 19/08/18 16:52:23 WARN DAGScheduler: The job 0 requires to run a barrier 
> stage that requires 3 slots than the total number of slots(2) in the cluster 
> currently.
> 19/08/18 16:52:38 WARN DAGScheduler: The job 0 requires to run a barrier 
> stage that requires 3 slots than the total number of slots(2) in the cluster 
> currently (Retry 1/3 failed).
> 19/08/18 16:52:53 WARN DAGScheduler: The job 0 requires to run a barrier 
> stage that requires 3 slots than the total number of slots(2) in the cluster 
> currently (Retry 2/3 failed).
> 19/08/18 16:53:08 WARN DAGScheduler: The job 0 requires to run a barrier 
> stage that requires 3 slots than the total number of slots(2) in the cluster 
> currently (Retry 3/3 failed).
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28482) Data incomplete when using pandas udf in Python 3

2019-08-22 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913611#comment-16913611
 ] 

Bryan Cutler commented on SPARK-28482:
--

I'm not really sure what you are doing above, are you saying the row count is 
not correct using Spark unmodified? I used 120,000 rows and it is correct for 
me:

{code}
In [1]: from pyspark.sql.types import * 
   ...: from pyspark.sql.functions import * 
   ...: df=spark.read.format('csv').option("header","true").load('test.csv') 
   ...: df=df.select(*(col(c).cast("int").alias(c) for c in df.columns)) 
   ...: df=df.repartition(1) 
   ...: def add_func(a,b,c,d,e,f,g): 
   ...: print('iterator one time') 
   ...: return a 
   ...: add = pandas_udf(add_func, returnType=IntegerType()) 
   ...: 
df_result=df.select(add(col("a"),col("b"),col("c"),col("d"),col("e"),col("f"),col("g")))
 

In [2]: r = df_result.toPandas()
 
[Stage 2:>  (0 + 1) / 
1]iterator one time
iterator one time
iterator one time
iterator one time
iterator one time
iterator one time
iterator one time
iterator one time
iterator one time
iterator one time
iterator one time
iterator one time

In [3]: len(r)  
 
Out[3]: 12
{code}

> Data incomplete when using pandas udf in Python 3
> -
>
> Key: SPARK-28482
> URL: https://issues.apache.org/jira/browse/SPARK-28482
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 2.4.3
> Environment: centos 7.4   
> pyarrow 0.10.0 0.14.0
> python 2.7 3.5 3.6
>Reporter: jiangyu
>Priority: Major
> Attachments: py2.7.png, py3.6.png, test.csv, test.py, worker.png
>
>
> Hi,
>   
>  Since Spark 2.3.x, pandas udf has been introduced as default ser/des method 
> when using udf. However, an issue raises with python >= 3.5.x version.
>  We use pandas udf to process batches of data, but we find the data is 
> incomplete in python 3.x. At first , i think the process logical maybe wrong, 
> so i change the code to very simple one and it has the same problem.After 
> investigate for a week, i find it is related to pyarrow.   
>   
>  *Reproduce procedure:*
> 1. prepare data
>  The data have seven column, a、b、c、d、e、f and g, data type is Integer
>  a,b,c,d,e,f,g
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>   produce 100,000 rows and name the file test.csv ,upload to hdfs, then load 
> it , and repartition it to 1 partition.
>   
> {code:java}
> df=spark.read.format('csv').option("header","true").load('/test.csv')
> df=df.select(*(col(c).cast("int").alias(c) for c in df.columns))
> df=df.repartition(1)
> spark_context = SparkContext.getOrCreate() {code}
>  
>  2.register pandas udf
>   
> {code:java}
> def add_func(a,b,c,d,e,f,g):
> print('iterator one time')
> return a
> add = pandas_udf(add_func, returnType=IntegerType())
> df_result=df.select(add(col("a"),col("b"),col("c"),col("d"),col("e"),col("f"),col("g"))){code}
>  
>  3.apply pandas udf
>   
> {code:java}
> def trigger_func(iterator):
>       yield iterator
> df_result.rdd.foreachPartition(trigger_func){code}
>  
>  4.execute it in pyspark (local or yarn)
>  run it with conf spark.sql.execution.arrow.maxRecordsPerBatch=10. As 
> mentioned before the total row number is 100, it should print "iterator 
> one time " 10 times.
>  (1)Python 2.7 envs:
>   
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/py2.7/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark.executor.pyspark.memory=2g --conf 
> spark.sql.execution.arrow.enabled=true --executor-cores 1{code}
>  
>  !py2.7.png!   
>  The result is right, 10 times of print.
>  
>  
> (2)Python 3.5 or 3.6 envs:
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/python3.6/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark.executor.pyspark.memory=2g --conf 
> spark.sql.execution.arrow.enabled=true --executor-cores{code}
>  
> !py3.6.png!
> The data is incomplete. Exception is print by jvm spark which have been added 
> by us , I will explain it later.
>   
>   
> h3. *Investigation*
> The “process done” is added in the worker.py.
>  !worker.png!
>  In order to get the exception,  change the spark code, the code is under 
> core/src/main/scala/org/apache/spark/util/Utils.scala , and add this code to 
> print the exception.
>   
>  
> {code:java}
> @@ -1362,6 +1362,8 @@ private[spark] object Utils

[jira] [Commented] (SPARK-28832) Document SHOW SCHEMAS statement in SQL Reference.

2019-08-22 Thread Dilip Biswal (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913560#comment-16913560
 ] 

Dilip Biswal commented on SPARK-28832:
--

[~jobitmathew] Thanks .. Yeah it will be documented as part of SHOW DATABASES 
Please review  [https://github.com/apache/spark/pull/25526] and let me know if 
you want anything changed. 

> Document SHOW SCHEMAS statement in SQL Reference.
> -
>
> Key: SPARK-28832
> URL: https://issues.apache.org/jira/browse/SPARK-28832
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-28846) Set OMP_NUM_THREADS to executor cores for python

2019-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-28846.
-

> Set OMP_NUM_THREADS to executor cores for python
> 
>
> Key: SPARK-28846
> URL: https://issues.apache.org/jira/browse/SPARK-28846
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28846) Set OMP_NUM_THREADS to executor cores for python

2019-08-22 Thread Ryan Blue (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-28846.
---
Resolution: Duplicate

> Set OMP_NUM_THREADS to executor cores for python
> 
>
> Key: SPARK-28846
> URL: https://issues.apache.org/jira/browse/SPARK-28846
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28856) DataSourceV2: Support SHOW DATABASES

2019-08-22 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913426#comment-16913426
 ] 

Terry Kim commented on SPARK-28856:
---

I will work on this.

> DataSourceV2: Support SHOW DATABASES
> 
>
> Key: SPARK-28856
> URL: https://issues.apache.org/jira/browse/SPARK-28856
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> SHOW DATABASES needs to support v2 catalogs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28856) DataSourceV2: Support SHOW DATABASES

2019-08-22 Thread Terry Kim (Jira)

Terry Kim created SPARK-28856:
-

 Summary: DataSourceV2: Support SHOW DATABASES
 Key: SPARK-28856
 URL: https://issues.apache.org/jira/browse/SPARK-28856
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Terry Kim


SHOW DATABASES needs to support v2 catalogs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28577) Ensure executorMemoryHead requested value not less than MEMORY_OFFHEAP_SIZE when MEMORY_OFFHEAP_ENABLED is true

2019-08-22 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-28577:
--
Docs Text: On YARN, The off heap memory size is now separately included for 
the user in the container size it requests from YARN. Previously you had to add 
that into the overhead memory you requested, that is no longer needed.

> Ensure executorMemoryHead requested value not less than MEMORY_OFFHEAP_SIZE 
> when MEMORY_OFFHEAP_ENABLED is true
> ---
>
> Key: SPARK-28577
> URL: https://issues.apache.org/jira/browse/SPARK-28577
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: release-notes
>
> If MEMORY_OFFHEAP_ENABLED is true, we should ensure executorOverheadMemory 
> not less than MEMORY_OFFHEAP_SIZE, otherwise the memory resource requested 
> for executor may be not enough.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28577) Ensure executorMemoryHead requested value not less than MEMORY_OFFHEAP_SIZE when MEMORY_OFFHEAP_ENABLED is true

2019-08-22 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-28577:
--
Labels: release-notes  (was: )

> Ensure executorMemoryHead requested value not less than MEMORY_OFFHEAP_SIZE 
> when MEMORY_OFFHEAP_ENABLED is true
> ---
>
> Key: SPARK-28577
> URL: https://issues.apache.org/jira/browse/SPARK-28577
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Priority: Major
>  Labels: release-notes
>
> If MEMORY_OFFHEAP_ENABLED is true, we should ensure executorOverheadMemory 
> not less than MEMORY_OFFHEAP_SIZE, otherwise the memory resource requested 
> for executor may be not enough.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28854) Zipping iterators in mapPartitions will fail

2019-08-22 Thread Hao Yang Ang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Yang Ang updated SPARK-28854:
-
Description: 
scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => 
xs.map(2*).zip(xs)).collect.foreach(println)

warning: there was one feature warning; re-run with -feature for details

19/08/22 21:13:18 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)

java.util.NoSuchElementException: next on empty iterator

 

 

Workaround - implement zip with mapping to tuple:

scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, 
x))).collect.foreach(println)

(2,1)

(4,2)

(6,3)

 

  was:
scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => 
xs.map(2*).zip(xs)).foreach(println)

warning: there was one feature warning; re-run with -feature for details

19/08/22 21:13:18 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)

java.util.NoSuchElementException: next on empty iterator

 

 

Workaround - implement zip with mapping to tuple:

scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, 
x))).collect.foreach(println)

(2,1)

(4,2)

(6,3)

 


> Zipping iterators in mapPartitions will fail
> 
>
> Key: SPARK-28854
> URL: https://issues.apache.org/jira/browse/SPARK-28854
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Hao Yang Ang
>Priority: Minor
>
> scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => 
> xs.map(2*).zip(xs)).collect.foreach(println)
> warning: there was one feature warning; re-run with -feature for details
> 19/08/22 21:13:18 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
> java.util.NoSuchElementException: next on empty iterator
>  
>  
> Workaround - implement zip with mapping to tuple:
> scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, 
> x))).collect.foreach(println)
> (2,1)
> (4,2)
> (6,3)
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13677) Support Tree-Based Feature Transformation for ML

2019-08-22 Thread Sean Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-13677.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25383
[https://github.com/apache/spark/pull/25383]

> Support Tree-Based Feature Transformation for ML
> 
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 3.0.0
>
>
> It would be nice to be able to use RF and GBT for feature transformation:
>  First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is 
> implemented in famous libraries:
> sklearn   
> [apply|[http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]]
> xgboost  
> [predict_leaf_index|[https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]]
> lightgbm 
> [predict_leaf_index|https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_leaf_index]
> catboost 
> [calc_leaf_index|https://github.com/catboost/tutorials/tree/master/leaf_indexes_calculation]
>  
>  
> Refering to the design of above impls, I propose following api:
> val model1 : DecisionTreeClassificationModel= ...
> model1.setLeafCol("leaves")
>  model1.transform(df)
>  
> val model2 : GBTClassificationModel = ...
> model2.getLeafCol
>  model2.transform(df)
>  
>  The detailed design doc: 
> [https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13677) Support Tree-Based Feature Transformation for ML

2019-08-22 Thread Sean Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-13677:
-

Assignee: zhengruifeng

> Support Tree-Based Feature Transformation for ML
> 
>
> Key: SPARK-13677
> URL: https://issues.apache.org/jira/browse/SPARK-13677
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> It would be nice to be able to use RF and GBT for feature transformation:
>  First fit an ensemble of trees (like RF, GBT or other TreeEnsambleModels) on 
> the training set. Then each leaf of each tree in the ensemble is assigned a 
> fixed arbitrary feature index in a new feature space. These leaf indices are 
> then encoded in a one-hot fashion.
> This method was first introduced by 
> facebook([http://www.herbrich.me/papers/adclicksfacebook.pdf]), and is 
> implemented in famous libraries:
> sklearn   
> [apply|[http://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html#example-ensemble-plot-feature-transformation-py]]
> xgboost  
> [predict_leaf_index|[https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py]]
> lightgbm 
> [predict_leaf_index|https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict_leaf_index]
> catboost 
> [calc_leaf_index|https://github.com/catboost/tutorials/tree/master/leaf_indexes_calculation]
>  
>  
> Refering to the design of above impls, I propose following api:
> val model1 : DecisionTreeClassificationModel= ...
> model1.setLeafCol("leaves")
>  model1.transform(df)
>  
> val model2 : GBTClassificationModel = ...
> model2.getLeafCol
>  model2.transform(df)
>  
>  The detailed design doc: 
> [https://docs.google.com/document/d/1d81qS0zfb6vqbt3dn6zFQUmWeh2ymoRALvhzPpTZqvo/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28776) SparkML MLWriter gets hadoop conf from spark context instead of session

2019-08-22 Thread Sean Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28776.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25505
[https://github.com/apache/spark/pull/25505]

> SparkML MLWriter gets hadoop conf from spark context instead of session
> ---
>
> Key: SPARK-28776
> URL: https://issues.apache.org/jira/browse/SPARK-28776
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Helen Yu
>Assignee: Helen Yu
>Priority: Minor
> Fix For: 3.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In handleOverwrite of MLWriter, the hadoop configuration of the spark context 
> is used where as the hadoop configuration of the spark session's session 
> state should be used instead. 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L677]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28776) SparkML MLWriter gets hadoop conf from spark context instead of session

2019-08-22 Thread Sean Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-28776:
-

Assignee: Helen Yu

> SparkML MLWriter gets hadoop conf from spark context instead of session
> ---
>
> Key: SPARK-28776
> URL: https://issues.apache.org/jira/browse/SPARK-28776
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.4.3
>Reporter: Helen Yu
>Assignee: Helen Yu
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In handleOverwrite of MLWriter, the hadoop configuration of the spark context 
> is used where as the hadoop configuration of the spark session's session 
> state should be used instead. 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala#L677]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28855) Remove outdated Experimental, Evolving annotations

2019-08-22 Thread Sean Owen (Jira)

Sean Owen created SPARK-28855:
-

 Summary: Remove outdated Experimental, Evolving annotations
 Key: SPARK-28855
 URL: https://issues.apache.org/jira/browse/SPARK-28855
 Project: Spark
  Issue Type: Bug
  Components: ML, Spark Core, SQL, Structured Streaming
Affects Versions: 3.0.0
Reporter: Sean Owen
Assignee: Sean Owen


The Experimental and Evolving annotations are both (like Unstable) used to 
express that a an API may change. However there are many things in the code 
that have been marked that way since even Spark 1.x. Per the dev@ thread, 
anything introduced at or before Spark 2.3.0 is pretty much 'stable' in that it 
would not change without a deprecation cycle.

Therefore I'd like to remove most of these annotations, leaving them for things 
that are obviously inherently experimental (ExperimentalMethods), or recently 
added and still legitimately experimental (DSv2, Barrier mode).



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28854) Zipping iterators in mapPartitions will fail

2019-08-22 Thread Hao Yang Ang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Yang Ang updated SPARK-28854:
-
Description: 
scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => 
xs.map(2*).zip(xs)).foreach(println)

warning: there was one feature warning; re-run with -feature for details

19/08/22 21:13:18 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)

java.util.NoSuchElementException: next on empty iterator

 

 

Workaround - implement zip with mapping to tuple:

scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, 
x))).collect.foreach(println)

(2,1)

(4,2)

(6,3)

 

  was:
scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => 
xs.map(2*).zip(xs)).foreach(println)

warning: there was one feature warning; re-run with -feature for details

19/08/22 21:13:18 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)

java.util.NoSuchElementException: next on empty iterator




Workaround - implement zip with mapping to tuple:


scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, 
x))).collect.foreach(println)

(2,1)

(4,2)

(6,3)





 


> Zipping iterators in mapPartitions will fail
> 
>
> Key: SPARK-28854
> URL: https://issues.apache.org/jira/browse/SPARK-28854
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Hao Yang Ang
>Priority: Minor
>
> scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => 
> xs.map(2*).zip(xs)).foreach(println)
> warning: there was one feature warning; re-run with -feature for details
> 19/08/22 21:13:18 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
> java.util.NoSuchElementException: next on empty iterator
>  
>  
> Workaround - implement zip with mapping to tuple:
> scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, 
> x))).collect.foreach(println)
> (2,1)
> (4,2)
> (6,3)
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28854) Zipping iterators in mapPartitions will fail

2019-08-22 Thread Hao Yang Ang (Jira)

Hao Yang Ang created SPARK-28854:


 Summary: Zipping iterators in mapPartitions will fail
 Key: SPARK-28854
 URL: https://issues.apache.org/jira/browse/SPARK-28854
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Hao Yang Ang


scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => 
xs.map(2*).zip(xs)).foreach(println)

warning: there was one feature warning; re-run with -feature for details

19/08/22 21:13:18 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)

java.util.NoSuchElementException: next on empty iterator




Workaround - implement zip with mapping to tuple:


scala> sc.parallelize(Seq(1, 2, 3)).mapPartitions(xs => xs.map(x => (x * 2, 
x))).collect.foreach(println)

(2,1)

(4,2)

(6,3)





 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28512) New optional mode: throw runtime exceptions on casting failures

2019-08-22 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-28512:
---
Description: 
In popular DBMS like MySQL/PostgreSQL/Oracle, runtime exceptions are thrown on 
casting, e.g. cast('abc' as int) 
While in Spark, the result is converted as null silently. It is by design since 
we don't want a long-running job aborted by some casting failure. But there are 
scenarios that users want to make sure all the data conversion are correct, 
like the way they use MySQL/PostgreSQL/Oracle.

This one has a bigger scope than 
https://issues.apache.org/jira/browse/SPARK-28741

  was:
In popular DBMS like MySQL/PostgreSQL/Oracle, runtime exceptions are thrown on 
casting, e.g. cast('abc' as int) 
While in Spark, the result is converted as null silently. It is by design since 
we don't want a long-running job aborted by some casting failure. But there are 
scenarios that users want to make sure all the data conversion are correct, 
like the way they use MySQL/PostgreSQL/Oracle.

If the changes touch too much code, we can limit the new optional mode to table 
insertion first. By default the new behavior is disabled.


> New optional mode: throw runtime exceptions on casting failures
> ---
>
> Key: SPARK-28512
> URL: https://issues.apache.org/jira/browse/SPARK-28512
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In popular DBMS like MySQL/PostgreSQL/Oracle, runtime exceptions are thrown 
> on casting, e.g. cast('abc' as int) 
> While in Spark, the result is converted as null silently. It is by design 
> since we don't want a long-running job aborted by some casting failure. But 
> there are scenarios that users want to make sure all the data conversion are 
> correct, like the way they use MySQL/PostgreSQL/Oracle.
> This one has a bigger scope than 
> https://issues.apache.org/jira/browse/SPARK-28741



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28853) Support conf to organize filePartitions by file path

2019-08-22 Thread ZhangYao (Jira)

ZhangYao created SPARK-28853:


 Summary:  Support conf to organize filePartitions by file path
 Key: SPARK-28853
 URL: https://issues.apache.org/jira/browse/SPARK-28853
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.3
Reporter: ZhangYao


When dynamicly writing data to hdfs it may generates a lot of small files, so 
sometimes we need to merge those files. When reading this files and writing 
again, it will be helpful if the read file RDD partitions is formed by 
partitions on hdfs.

Currently in FileSourceScanExec.createNonBucketedReadRDD after spliting files, 
spark will sort files with file size so it may scatter the partition 
distribution of the data files. It is a great help to support sort by file path 
here :)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28741) New optional mode: Throw exceptions when casting to integers causes overflow

2019-08-22 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-28741:
---
Summary: New optional mode: Throw exceptions when casting to integers 
causes overflow  (was: Throw exceptions when casting to integers causes 
overflow)

> New optional mode: Throw exceptions when casting to integers causes overflow
> 
>
> Key: SPARK-28741
> URL: https://issues.apache.org/jira/browse/SPARK-28741
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> To follow ANSI SQL, we should support a configurable mode that throws 
> exceptions when casting to integers causes overflow.
> The behavior is similar to https://issues.apache.org/jira/browse/SPARK-26218, 
> which throws exceptions on arithmetical operation overflow.
> To unify it, the configuration is renamed from 
> "spark.sql.arithmeticOperations.failOnOverFlow" to 
> "spark.sql.failOnIntegerOverFlow"



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28503) Return null result on cast an out-of-range value to a integral type

2019-08-22 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-28503.

Resolution: Won't Fix

After consideration, I decide to close this one and open 
https://issues.apache.org/jira/browse/SPARK-28741 The current behavior is 
actually compatible with Hive. The changes in this PR might break existing 
queries, while there is no similar behavior in other DBMS.
If users care about the overflow, they can use enable the configuration 
proposed in https://issues.apache.org/jira/browse/SPARK-28741.

> Return null result on cast an out-of-range value to a integral type
> ---
>
> Key: SPARK-28503
> URL: https://issues.apache.org/jira/browse/SPARK-28503
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Currently, when we convert an out-of-range value to a numeric type, the value 
> is unexpected
> scala> spark.sql("select cast(327689 as short)").show()
> ++
> |CAST(327689 AS SMALLINT)|
> ++
> |   9|
> ++
> The result is actually 327689.toShort  (327689 & 0x).
> For such cases, I think we should return null.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28741) Throw exceptions when casting to integers causes overflow

2019-08-22 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-28741:
---
Parent: SPARK-28589
Issue Type: Sub-task  (was: New Feature)

> Throw exceptions when casting to integers causes overflow
> -
>
> Key: SPARK-28741
> URL: https://issues.apache.org/jira/browse/SPARK-28741
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> To follow ANSI SQL, we should support a configurable mode that throws 
> exceptions when casting to integers causes overflow.
> The behavior is similar to https://issues.apache.org/jira/browse/SPARK-26218, 
> which throws exceptions on arithmetical operation overflow.
> To unify it, the configuration is renamed from 
> "spark.sql.arithmeticOperations.failOnOverFlow" to 
> "spark.sql.failOnIntegerOverFlow"



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28741) Throw exceptions when casting to integers causes overflow

2019-08-22 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-28741:
---
Parent: (was: SPARK-26217)
Issue Type: New Feature  (was: Sub-task)

> Throw exceptions when casting to integers causes overflow
> -
>
> Key: SPARK-28741
> URL: https://issues.apache.org/jira/browse/SPARK-28741
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> To follow ANSI SQL, we should support a configurable mode that throws 
> exceptions when casting to integers causes overflow.
> The behavior is similar to https://issues.apache.org/jira/browse/SPARK-26218, 
> which throws exceptions on arithmetical operation overflow.
> To unify it, the configuration is renamed from 
> "spark.sql.arithmeticOperations.failOnOverFlow" to 
> "spark.sql.failOnIntegerOverFlow"



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28852) Implement GetCatalogsOperation

2019-08-22 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-28852:
---

 Summary: Implement GetCatalogsOperation
 Key: SPARK-28852
 URL: https://issues.apache.org/jira/browse/SPARK-28852
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28852) Implement GetCatalogsOperation for Thrift Server

2019-08-22 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28852:

Summary: Implement GetCatalogsOperation for Thrift Server  (was: Implement 
GetCatalogsOperation)

> Implement GetCatalogsOperation for Thrift Server
> 
>
> Key: SPARK-28852
> URL: https://issues.apache.org/jira/browse/SPARK-28852
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28848) insert overwrite local directory stored as parquet does not creates snappy.parquet data file at local directory path

2019-08-22 Thread Ajith S (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S resolved SPARK-28848.
-
Resolution: Duplicate

Will be fixed as part of SPARK-28659

> insert overwrite local directory  stored as parquet does not creates 
> snappy.parquet data file at local directory path
> ---
>
> Key: SPARK-28848
> URL: https://issues.apache.org/jira/browse/SPARK-28848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/func> insert overwrite local directory 
> '/opt/trash4/' stored as parquet select * from trash1 a where a.country='PAK';
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (1.368 seconds)
> {code}
> Data file at local directory path:
> {code}
> vm1:/opt/trash4 # ll
> total 12
> -rw-r--r-- 1 root root   8 Aug 22 14:30 ._SUCCESS.crc
> -rw-r--r-- 1 root root  16 Aug 22 14:30 
> .part-1-2b17ec6a-ef7e-4b45-927e-f93b88ff4f65-c000.crc
> -rw-r--r-- 1 root root   0 Aug 22 14:30 _SUCCESS
> -rw-r--r-- 1 root root 619 Aug 22 14:30 
> part-1-2b17ec6a-ef7e-4b45-927e-f93b88ff4f65-c000
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28495) Introduce ANSI store assignment policy for table insertion

2019-08-22 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-28495:
---
Description: 
In Spark version 2.4 and earlier, when inserting into a table, Spark will cast 
the data type of input query to the data type of target table by coercion. This 
can be super confusing, e.g. users make a mistake and write string values to an 
int column.

In data source V2,  by default, only upcasting is allowed when inserting data 
into a table. E.g. int -> long and int -> string are allowed, while decimal -> 
double or long -> int are not allowed. The rules of UpCast was originally 
created for Dataset type coercion. They are quite strict and different from the 
behavior of all existing popular DBMS. This is breaking change. It is possible 
that existing queries are broken after 3.0 releases.

Following ANSI SQL standard is the most proper solution as the community voted 
in dev list. 
For more details, see the discussion on 
http://apache-spark-developers-list.1001551.n3.nabble.com/Discuss-Follow-ANSI-SQL-on-table-insertion-td27531.html#a27562
 and https://github.com/apache/spark/pull/25453 .

This task is to add ANSI store assignment policy as the default option of 
configuration "spark.sql.storeAssignmentPolicy“

  was:
In Spark version 2.4 and earlier, when inserting into a table, Spark will cast 
the data type of input query to the data type of target table by coercion. This 
can be super confusing, e.g. users make a mistake and write string values to an 
int column.

In data source V2,  by default, only upcasting is allowed when inserting data 
into a table. E.g. int -> long and int -> string are allowed, while decimal -> 
double or long -> int are not allowed. The rules of UpCast was originally 
created for Dataset type coercion. They are quite strict and different from the 
behavior of all existing popular DBMS. This is breaking change. It is possible 
that it would hurt some Spark users after 3.0 releases.

This PR proposes that we can follow the rules of store assignment(section 9.2) 
in ANSI SQL. Two significant differences from Up-Cast:
1. Any numeric type can be assigned to another numeric type.
2. TimestampType can be assigned DateType

The new behavior is consistent with PostgreSQL. It is more explainable and 
acceptable than using UpCast .





> Introduce ANSI store assignment policy for table insertion
> --
>
> Key: SPARK-28495
> URL: https://issues.apache.org/jira/browse/SPARK-28495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In Spark version 2.4 and earlier, when inserting into a table, Spark will 
> cast the data type of input query to the data type of target table by 
> coercion. This can be super confusing, e.g. users make a mistake and write 
> string values to an int column.
> In data source V2,  by default, only upcasting is allowed when inserting data 
> into a table. E.g. int -> long and int -> string are allowed, while decimal 
> -> double or long -> int are not allowed. The rules of UpCast was originally 
> created for Dataset type coercion. They are quite strict and different from 
> the behavior of all existing popular DBMS. This is breaking change. It is 
> possible that existing queries are broken after 3.0 releases.
> Following ANSI SQL standard is the most proper solution as the community 
> voted in dev list. 
> For more details, see the discussion on 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Discuss-Follow-ANSI-SQL-on-table-insertion-td27531.html#a27562
>  and https://github.com/apache/spark/pull/25453 .
> This task is to add ANSI store assignment policy as the default option of 
> configuration "spark.sql.storeAssignmentPolicy“



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until it is killed manually. Here's 
the log you can see, there's no any log after spilling the shuffle data to 
disk, but the executor is still alive.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system call, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until it is killed manually. Here's 
the log you can see, there's no any log after spilling the shuffle data to 
disk, but the executor is still alive.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until it is killed manually. Here's 
> the log you can see, there's no any log after spilling the shuffle data to 
> disk, but the executor is still alive.
>  !95330.png! 
> And here is the thread dump, we could see that it always calls native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system call, we found that this thread is 
> always calling {{fstat}}, and the system usage is pretty high, here is the 
> screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28797) Document DROP FUNCTION statement in SQL Reference.

2019-08-22 Thread Sandeep Katta (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913219#comment-16913219
 ] 

Sandeep Katta commented on SPARK-28797:
---

PR is created  https://github.com/apache/spark/pull/25553

> Document DROP FUNCTION statement in SQL Reference.
> --
>
> Key: SPARK-28797
> URL: https://issues.apache.org/jira/browse/SPARK-28797
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28482) Data incomplete when using pandas udf in Python 3

2019-08-22 Thread jiangyu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913183#comment-16913183
 ] 

jiangyu edited comment on SPARK-28482 at 8/22/19 9:37 AM:
--

hi, [~bryanc] , maybe you should produce more data, like 100,000 rows, and read 
10,000 rows every iteration. The number of the rows is not right, is smaller 
than expected.

I have investigate this issue this week,  i find the row numbers is correct 
when arrow read from the socket , so in  serializers.py , i revise the method 
of dump_stream,  change the stream to local stream
{code:java}
// code placeholder
def dump_stream(self, iterator, stream):
"""
Make ArrowRecordBatches from Pandas Series and serialize. Input is a single 
series or
a list of series accompanied by an optional pyarrow type to coerce the data 
to.
"""
import pyarrow as pa
writer = None
local_stream = pa.output_stream('/tmp/output')
try:
for series in iterator:
batch = _create_batch(series, self._timezone)
if writer is None:
# write_int(SpecialLengths.START_ARROW_STREAM, stream)
# writer = pa.RecordBatchStreamWriter(stream, batch.schema)
write_int(SpecialLengths.START_ARROW_STREAM, local_stream)
writer = pa.RecordBatchStreamWriter(local_stream, batch.schema)
writer.write_batch(batch)
finally:
if writer is not None:
writer.close()
{code}
 

The row numbers is correct, and no exception throw.

Then i  change the daemon.py , and increase the buffer size of outfile, from 
65536 to 65536.
{code:java}
// code placeholder
def worker(sock, authenticated):
"""
Called by a worker process after the fork().
"""
signal.signal(SIGHUP, SIG_DFL)
signal.signal(SIGCHLD, SIG_DFL)
signal.signal(SIGTERM, SIG_DFL)
# restore the handler for SIGINT,
# it's useful for debugging (show the stacktrace before exit)
signal.signal(SIGINT, signal.default_int_handler)

# Read the socket using fdopen instead of socket.makefile() because the 
latter
# seems to be very slow; note that we need to dup() the file descriptor 
because
# otherwise writes also cause a seek that makes us miss data on the read 
side.
infile = os.fdopen(os.dup(sock.fileno()), "rb", 65536)
outfile = os.fdopen(os.dup(sock.fileno()), "wb", 65536)
{code}
And everything is ok. I don't know if it is safe to increase buffer size to 
that high. But it really help us.


was (Author: jiangyu1211):
hi, [~bryanc] , maybe you should produce more data, like 100,000 rows, and read 
10,000 rows every iteration. The number of the rows is not right, is smaller 
than expected.

I have investigate this issue this week,  i find the row numbers is correct 
when arrow read from the socket , so in  serializers.py , i revise the method 
of dump_stream,  change the stream to local stream
{code:java}
// code placeholder
def dump_stream(self, iterator, stream):
"""
Make ArrowRecordBatches from Pandas Series and serialize. Input is a single 
series or
a list of series accompanied by an optional pyarrow type to coerce the data 
to.
"""
import pyarrow as pa
writer = None
local_stream = pa.output_stream('/tmp/output')
try:
for series in iterator:
batch = _create_batch(series, self._timezone)
if writer is None:
# write_int(SpecialLengths.START_ARROW_STREAM, stream)
# writer = pa.RecordBatchStreamWriter(stream, batch.schema)
write_int(SpecialLengths.START_ARROW_STREAM, local_stream)
writer = pa.RecordBatchStreamWriter(local_stream, batch.schema)
writer.write_batch(batch)
finally:
if writer is not None:
writer.close()
{code}
 

The row numbers is correct, and no exception throw.

Then i  change the daemon.py , and increase the buffer size of outfile, from 
65536 to 65536.
{code:java}
// code placeholder
def worker(sock, authenticated):
"""
Called by a worker process after the fork().
"""
signal.signal(SIGHUP, SIG_DFL)
signal.signal(SIGCHLD, SIG_DFL)
signal.signal(SIGTERM, SIG_DFL)
# restore the handler for SIGINT,
# it's useful for debugging (show the stacktrace before exit)
signal.signal(SIGINT, signal.default_int_handler)

# Read the socket using fdopen instead of socket.makefile() because the 
latter
# seems to be very slow; note that we need to dup() the file descriptor 
because
# otherwise writes also cause a seek that makes us miss data on the read 
side.
infile = os.fdopen(os.dup(sock.fileno()), "rb", 65536)
outfile = os.fdopen(os.dup(sock.fileno()), "wb", 65536)
{code}
And everything is ok. So i don't know if it is safe to increase buffer size to 
this high. But it is really help us.

>

[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until it is killed manually. Here's 
the log you can see, there's no any log after spilling the shuffle data to 
disk, but the executor is still alive.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until it is killed manually. Here's 
the log you can see, there's no any log after spill the shuffle data to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until it is killed manually. Here's 
> the log you can see, there's no any log after spilling the shuffle data to 
> disk, but the executor is still alive.
>  !95330.png! 
> And here is the thread dump, we could see that it always calls native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, and the system usage is pretty high, here is the 
> screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28482) Data incomplete when using pandas udf in Python 3

2019-08-22 Thread jiangyu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913183#comment-16913183
 ] 

jiangyu commented on SPARK-28482:
-

hi, [~bryanc] , maybe you should produce more data, like 100,000 rows, and read 
10,000 rows every iteration. The number of the rows is not right, is smaller 
than expected.

I have investigate this issue this week,  i find the row numbers is correct 
when arrow read from the socket , so in  serializers.py , i revise the method 
of dump_stream,  change the stream to local stream
{code:java}
// code placeholder
def dump_stream(self, iterator, stream):
"""
Make ArrowRecordBatches from Pandas Series and serialize. Input is a single 
series or
a list of series accompanied by an optional pyarrow type to coerce the data 
to.
"""
import pyarrow as pa
writer = None
local_stream = pa.output_stream('/tmp/output')
try:
for series in iterator:
batch = _create_batch(series, self._timezone)
if writer is None:
# write_int(SpecialLengths.START_ARROW_STREAM, stream)
# writer = pa.RecordBatchStreamWriter(stream, batch.schema)
write_int(SpecialLengths.START_ARROW_STREAM, local_stream)
writer = pa.RecordBatchStreamWriter(local_stream, batch.schema)
writer.write_batch(batch)
finally:
if writer is not None:
writer.close()
{code}
 

The row numbers is correct, and no exception throw.

Then i  change the daemon.py , and increase the buffer size of outfile, from 
65536 to 65536.
{code:java}
// code placeholder
def worker(sock, authenticated):
"""
Called by a worker process after the fork().
"""
signal.signal(SIGHUP, SIG_DFL)
signal.signal(SIGCHLD, SIG_DFL)
signal.signal(SIGTERM, SIG_DFL)
# restore the handler for SIGINT,
# it's useful for debugging (show the stacktrace before exit)
signal.signal(SIGINT, signal.default_int_handler)

# Read the socket using fdopen instead of socket.makefile() because the 
latter
# seems to be very slow; note that we need to dup() the file descriptor 
because
# otherwise writes also cause a seek that makes us miss data on the read 
side.
infile = os.fdopen(os.dup(sock.fileno()), "rb", 65536)
outfile = os.fdopen(os.dup(sock.fileno()), "wb", 65536)
{code}
And everything is ok. So i don't know if it is safe to increase buffer size to 
this high. But it is really help us.

> Data incomplete when using pandas udf in Python 3
> -
>
> Key: SPARK-28482
> URL: https://issues.apache.org/jira/browse/SPARK-28482
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 2.4.3
> Environment: centos 7.4   
> pyarrow 0.10.0 0.14.0
> python 2.7 3.5 3.6
>Reporter: jiangyu
>Priority: Major
> Attachments: py2.7.png, py3.6.png, test.csv, test.py, worker.png
>
>
> Hi,
>   
>  Since Spark 2.3.x, pandas udf has been introduced as default ser/des method 
> when using udf. However, an issue raises with python >= 3.5.x version.
>  We use pandas udf to process batches of data, but we find the data is 
> incomplete in python 3.x. At first , i think the process logical maybe wrong, 
> so i change the code to very simple one and it has the same problem.After 
> investigate for a week, i find it is related to pyarrow.   
>   
>  *Reproduce procedure:*
> 1. prepare data
>  The data have seven column, a、b、c、d、e、f and g, data type is Integer
>  a,b,c,d,e,f,g
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>  1,2,3,4,5,6,7
>   produce 100,000 rows and name the file test.csv ,upload to hdfs, then load 
> it , and repartition it to 1 partition.
>   
> {code:java}
> df=spark.read.format('csv').option("header","true").load('/test.csv')
> df=df.select(*(col(c).cast("int").alias(c) for c in df.columns))
> df=df.repartition(1)
> spark_context = SparkContext.getOrCreate() {code}
>  
>  2.register pandas udf
>   
> {code:java}
> def add_func(a,b,c,d,e,f,g):
> print('iterator one time')
> return a
> add = pandas_udf(add_func, returnType=IntegerType())
> df_result=df.select(add(col("a"),col("b"),col("c"),col("d"),col("e"),col("f"),col("g"))){code}
>  
>  3.apply pandas udf
>   
> {code:java}
> def trigger_func(iterator):
>       yield iterator
> df_result.rdd.foreachPartition(trigger_func){code}
>  
>  4.execute it in pyspark (local or yarn)
>  run it with conf spark.sql.execution.arrow.maxRecordsPerBatch=10. As 
> mentioned before the total row number is 100, it should print "iterator 
> one time " 10 times.
>  (1)Python 2.7 envs:
>   
> {code:java}
> PYSPARK_PYTHON=/usr/lib/conda/envs/py2.7/bin/python pyspark --conf 
> spark.sql.execution.arrow.maxRecordsPerBatch=10 --conf 
> spark

[jira] [Updated] (SPARK-28495) Introduce ANSI store assignment policy for table insertion

2019-08-22 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-28495:
---
Summary: Introduce ANSI store assignment policy for table insertion  (was: 
Follow ANSI SQL on table insertion)

> Introduce ANSI store assignment policy for table insertion
> --
>
> Key: SPARK-28495
> URL: https://issues.apache.org/jira/browse/SPARK-28495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In Spark version 2.4 and earlier, when inserting into a table, Spark will 
> cast the data type of input query to the data type of target table by 
> coercion. This can be super confusing, e.g. users make a mistake and write 
> string values to an int column.
> In data source V2,  by default, only upcasting is allowed when inserting data 
> into a table. E.g. int -> long and int -> string are allowed, while decimal 
> -> double or long -> int are not allowed. The rules of UpCast was originally 
> created for Dataset type coercion. They are quite strict and different from 
> the behavior of all existing popular DBMS. This is breaking change. It is 
> possible that it would hurt some Spark users after 3.0 releases.
> This PR proposes that we can follow the rules of store assignment(section 
> 9.2) in ANSI SQL. Two significant differences from Up-Cast:
> 1. Any numeric type can be assigned to another numeric type.
> 2. TimestampType can be assigned DateType
> The new behavior is consistent with PostgreSQL. It is more explainable and 
> acceptable than using UpCast .



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2019-08-22 Thread Nikita Gorbachevski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913181#comment-16913181
 ] 

Nikita Gorbachevski edited comment on SPARK-22876 at 8/22/19 9:22 AM:
--

Hi [~praveentallapudi], these options still work in cases when yarn kills 
driver forcefully and shutdown hook is not invoked, e.g. OOM or node manager 
failure. For other cases i implemented the same feature programmatically via 
running SparkContext in separate thread and stop/start it on non fatal 
exceptions from the main thread.


was (Author: choojoyq):
Hi [~praveentallapudi], these options still work in cases when yarn kills 
driver forcefully and shutdown hook is not invoked, e.g. OOM or node manager 
failure. For other cases i implemented the same feature programmatically via 
running SparkContext is separated thread and stop/start it on non fatal 
exceptions from main thread.

> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22876
> URL: https://issues.apache.org/jira/browse/SPARK-22876
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>  Labels: bulk-closed
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2019-08-22 Thread Nikita Gorbachevski (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913181#comment-16913181
 ] 

Nikita Gorbachevski commented on SPARK-22876:
-

Hi [~praveentallapudi], these options still work in cases when yarn kills 
driver forcefully and shutdown hook is not invoked, e.g. OOM or node manager 
failure. For other cases i implemented the same feature programmatically via 
running SparkContext is separated thread and stop/start it on non fatal 
exceptions from main thread.

> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22876
> URL: https://issues.apache.org/jira/browse/SPARK-22876
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>  Labels: bulk-closed
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28730) Configurable type coercion policy for table insertion

2019-08-22 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-28730:
---
Description: 
After all the discussions in the dev list: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Discuss-Follow-ANSI-SQL-on-table-insertion-td27531.html#a27562.
 
Here I propose that we can make the store assignment rules in the analyzer 
configurable, and the behavior of V1 and V2 should be consistent.
When inserting a value into a column with a different data type, Spark will 
perform type coercion. After this PR, we support 2 policies for the type 
coercion rules: 
legacy and strict. 
1. With legacy policy, Spark allows casting any value to any data type and null 
result is returned when the conversion is invalid. The legacy policy is the 
only behavior in Spark 2.x and it is compatible with Hive. 
2. With strict policy, Spark doesn't allow any possible precision loss or data 
truncation in type coercion, e.g. `int` and `long`, `float` -> `double` are not 
allowed.

Eventually, the "legacy" mode will be removed, so it is disallowed in data 
source V2.
To ensure backward compatibility with existing queries, the default store 
assignment policy for data source V1 is "legacy" before ANSI mode is 
implemented.

  was:
After all the discussions in the dev list: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Discuss-Follow-ANSI-SQL-on-table-insertion-td27531.html#a27562.
 
Here I propose that we can make the store assignment rules in the analyzer 
configurable, and the behavior of V1 and V2 should be consistent.
When inserting a value into a column with a different data type, Spark will 
perform type coercion. After this PR, we support 2 policies for the type 
coercion rules: 
legacy and strict. 
1. With legacy policy, Spark allows casting any value to any data type and null 
result is returned when the conversion is invalid. The legacy policy is the 
only behavior in Spark 2.x and it is compatible with Hive. 
2. With strict policy, Spark doesn't allow any possible precision loss or data 
truncation in type coercion, e.g. `int` and `long`, `float` -> `double` are not 
allowed.

To ensure backward compatibility with existing queries, the default store 
assignment policy is "legacy".


> Configurable type coercion policy for table insertion
> -
>
> Key: SPARK-28730
> URL: https://issues.apache.org/jira/browse/SPARK-28730
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> After all the discussions in the dev list: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Discuss-Follow-ANSI-SQL-on-table-insertion-td27531.html#a27562.
>  
> Here I propose that we can make the store assignment rules in the analyzer 
> configurable, and the behavior of V1 and V2 should be consistent.
> When inserting a value into a column with a different data type, Spark will 
> perform type coercion. After this PR, we support 2 policies for the type 
> coercion rules: 
> legacy and strict. 
> 1. With legacy policy, Spark allows casting any value to any data type and 
> null result is returned when the conversion is invalid. The legacy policy is 
> the only behavior in Spark 2.x and it is compatible with Hive. 
> 2. With strict policy, Spark doesn't allow any possible precision loss or 
> data truncation in type coercion, e.g. `int` and `long`, `float` -> `double` 
> are not allowed.
> Eventually, the "legacy" mode will be removed, so it is disallowed in data 
> source V2.
> To ensure backward compatibility with existing queries, the default store 
> assignment policy for data source V1 is "legacy" before ANSI mode is 
> implemented.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28851) Connect HBase using Spark SQL in Spark 2.x

2019-08-22 Thread ARUN KINDRA (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ARUN KINDRA updated SPARK-28851:

Description: 
Hi,

 

I am basically trying a sample Spark SQL Code which actually read data from 
Oracle and store it into the HBase. I found a spark-Hbase connector to write a 
data into HBase where I need to provide a catalog. But it seems that it was 
only available until Spark 1.6. Now, what is the way to connect to HBase using 
Spark SqlContext.

 

  was:
Hi,

 

I am basically trying a sample Spark SQL Code which actually read data from 
Oracle and store it into the HBase. I found a spark-Hbase connector where I 
need to provide a catalog. But it seems that it was only available until Spark 
1.6. Now, what is the way to connect to HBase using Spark SqlContext.

 


> Connect HBase using Spark SQL in Spark 2.x
> --
>
> Key: SPARK-28851
> URL: https://issues.apache.org/jira/browse/SPARK-28851
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: ARUN KINDRA
>Priority: Major
>
> Hi,
>  
> I am basically trying a sample Spark SQL Code which actually read data from 
> Oracle and store it into the HBase. I found a spark-Hbase connector to write 
> a data into HBase where I need to provide a catalog. But it seems that it was 
> only available until Spark 1.6. Now, what is the way to connect to HBase 
> using Spark SqlContext.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28730) Configurable type coercion policy for table insertion

2019-08-22 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-28730:
---
Parent: SPARK-28589
Issue Type: Sub-task  (was: Improvement)

> Configurable type coercion policy for table insertion
> -
>
> Key: SPARK-28730
> URL: https://issues.apache.org/jira/browse/SPARK-28730
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> After all the discussions in the dev list: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Discuss-Follow-ANSI-SQL-on-table-insertion-td27531.html#a27562.
>  
> Here I propose that we can make the store assignment rules in the analyzer 
> configurable, and the behavior of V1 and V2 should be consistent.
> When inserting a value into a column with a different data type, Spark will 
> perform type coercion. After this PR, we support 2 policies for the type 
> coercion rules: 
> legacy and strict. 
> 1. With legacy policy, Spark allows casting any value to any data type and 
> null result is returned when the conversion is invalid. The legacy policy is 
> the only behavior in Spark 2.x and it is compatible with Hive. 
> 2. With strict policy, Spark doesn't allow any possible precision loss or 
> data truncation in type coercion, e.g. `int` and `long`, `float` -> `double` 
> are not allowed.
> To ensure backward compatibility with existing queries, the default store 
> assignment policy is "legacy".



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28851) Connect HBase using Spark SQL in Spark 2.x

2019-08-22 Thread ARUN KINDRA (Jira)

ARUN KINDRA created SPARK-28851:
---

 Summary: Connect HBase using Spark SQL in Spark 2.x
 Key: SPARK-28851
 URL: https://issues.apache.org/jira/browse/SPARK-28851
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 2.4.0
Reporter: ARUN KINDRA


Hi,

 

I am basically trying a sample Spark SQL Code which actually read data from 
Oracle and store it into the HBase. I found a spark-Hbase connector where I 
need to provide a catalog. But it seems that it was only available until Spark 
1.6. Now, what is the way to connect to HBase using Spark SqlContext.

 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28850) Binary Files RDD allocates false number of threads

2019-08-22 Thread Marco Lotz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Lotz updated SPARK-28850:
---
Description: 
When making a call to:
{code:java}
sc.binaryFiles(somePath){code}
 

It creates a BinaryFileRDD. Some sections of that code are run inside the 
driver container. The current source code for BinaryFileRDD is [available 
here|[https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala]]:

The problematic line is:

 
{code:java}
conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, 
Runtime.getRuntime.availableProcessors().toString)
{code}
 

This line sets the number of Threads to be used (in the case of multi-threading 
reading) to the number of cores (including Hyper Threading ones) available one 
the driver host machine.

This number is false, since what really matters is the number of cores 
allocated to the driver container by YARN and not the number of cores available 
in the host machine. This can easily impact the Spark-UI and the driver 
application performance, since the number of threads is far bigger than the 
true amount of allocated cores - which increases the number of unrequired 
preemptions and context switches

The solution is to retrieve the number of cores allocated to the Application 
Master by YARN instead.

Once confirmed the problem, I can work on retrieving that information and 
making a PR.

  was:
When making a call to:
{code:java}
sc.binaryFiles(somePath){code}
 

It creates a BinaryFileRDD. Some sections of that code are run inside the 
driver container. The current source code for BinaryFileRDD is [available 
here|[https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala]]
 :

The problematic line is:

 
{code:java}
conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, 
Runtime.getRuntime.availableProcessors().toString)
{code}
 

This line sets the number of Threads to be used (in the case of multi-threading 
reading) to the number of cores (including Hyper Threading ones) available one 
the driver host machine.

This number is false, since what really matters is the number of cores 
allocated to the driver container by YARN and not the number of cores available 
in the host machine. This can easily impact the Spark-UI and the driver 
application performance, since the number of threads is far bigger than the 
true amount of allocated cores - which increases the number of unrequired 
preemptions and context switches

The solution is to retrieve the number of cores allocated to the Application 
Master by YARN instead.

Once confirmed the problem, I can work on retrieving that information and 
making a PR.


> Binary Files RDD allocates false number of threads
> --
>
> Key: SPARK-28850
> URL: https://issues.apache.org/jira/browse/SPARK-28850
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.3
>Reporter: Marco Lotz
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When making a call to:
> {code:java}
> sc.binaryFiles(somePath){code}
>  
> It creates a BinaryFileRDD. Some sections of that code are run inside the 
> driver container. The current source code for BinaryFileRDD is [available 
> here|[https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala]]:
> The problematic line is:
>  
> {code:java}
> conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, 
> Runtime.getRuntime.availableProcessors().toString)
> {code}
>  
> This line sets the number of Threads to be used (in the case of 
> multi-threading reading) to the number of cores (including Hyper Threading 
> ones) available one the driver host machine.
> This number is false, since what really matters is the number of cores 
> allocated to the driver container by YARN and not the number of cores 
> available in the host machine. This can easily impact the Spark-UI and the 
> driver application performance, since the number of threads is far bigger 
> than the true amount of allocated cores - which increases the number of 
> unrequired preemptions and context switches
> The solution is to retrieve the number of cores allocated to the Application 
> Master by YARN instead.
> Once confirmed the problem, I can work on retrieving that information and 
> making a PR.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28850) Binary Files RDD allocates false number of threads

2019-08-22 Thread Marco Lotz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Lotz updated SPARK-28850:
---
Description: 
When making a call to:
{code:java}
sc.binaryFiles(somePath){code}
 

It creates a BinaryFileRDD. Some sections of that code are run inside the 
driver container. The current source code for BinaryFileRDD is [available 
here|[https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala]:

The problematic line is:

 
{code:java}
conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, 
Runtime.getRuntime.availableProcessors().toString)
{code}
 

This line sets the number of Threads to be used (in the case of multi-threading 
reading) to the number of cores (including Hyper Threading ones) available one 
the driver host machine.

This number is false, since what really matters is the number of cores 
allocated to the driver container by YARN and not the number of cores available 
in the host machine. This can easily impact the Spark-UI and the driver 
application performance, since the number of threads is far bigger than the 
true amount of allocated cores - which increases the number of unrequired 
preemptions and context switches

The solution is to retrieve the number of cores allocated to the Application 
Master by YARN instead.

Once confirmed the problem, I can work on retrieving that information and 
making a PR.

  was:
When making a call to:
{code:java}
sc.binaryFiles(somePath){code}
 

It creates a BinaryFileRDD. Some sections of that code are run inside the 
driver container. The current source code for BinaryFileRDD is available here:
 
[https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala

]

The problematic line is:

 
{code:java}
conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, 
Runtime.getRuntime.availableProcessors().toString)
{code}
 

This line sets the number of Threads to be used (in the case of multi-threading 
reading) to the number of cores (including Hyper Threading ones) available one 
the driver host machine.

This number is false, since what really matters is the number of cores 
allocated to the driver container by YARN and not the number of cores available 
in the host machine. This can easily impact the Spark-UI and the driver 
application performance, since the number of threads is far bigger than the 
true amount of allocated cores - which increases the number of unrequired 
preemptions and context switches

The solution is to retrieve the number of cores allocated to the Application 
Master by YARN instead.

Once confirmed the problem, I can work on retrieving that information and 
making a PR.


> Binary Files RDD allocates false number of threads
> --
>
> Key: SPARK-28850
> URL: https://issues.apache.org/jira/browse/SPARK-28850
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.3
>Reporter: Marco Lotz
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When making a call to:
> {code:java}
> sc.binaryFiles(somePath){code}
>  
> It creates a BinaryFileRDD. Some sections of that code are run inside the 
> driver container. The current source code for BinaryFileRDD is [available 
> here|[https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala]:
> The problematic line is:
>  
> {code:java}
> conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, 
> Runtime.getRuntime.availableProcessors().toString)
> {code}
>  
> This line sets the number of Threads to be used (in the case of 
> multi-threading reading) to the number of cores (including Hyper Threading 
> ones) available one the driver host machine.
> This number is false, since what really matters is the number of cores 
> allocated to the driver container by YARN and not the number of cores 
> available in the host machine. This can easily impact the Spark-UI and the 
> driver application performance, since the number of threads is far bigger 
> than the true amount of allocated cores - which increases the number of 
> unrequired preemptions and context switches
> The solution is to retrieve the number of cores allocated to the Application 
> Master by YARN instead.
> Once confirmed the problem, I can work on retrieving that information and 
> making a PR.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28850) Binary Files RDD allocates false number of threads

2019-08-22 Thread Marco Lotz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Lotz updated SPARK-28850:
---
Description: 
When making a call to:
{code:java}
sc.binaryFiles(somePath){code}
 

It creates a BinaryFileRDD. Some sections of that code are run inside the 
driver container. The current source code for BinaryFileRDD is [available 
here|[https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala]]
 :

The problematic line is:

 
{code:java}
conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, 
Runtime.getRuntime.availableProcessors().toString)
{code}
 

This line sets the number of Threads to be used (in the case of multi-threading 
reading) to the number of cores (including Hyper Threading ones) available one 
the driver host machine.

This number is false, since what really matters is the number of cores 
allocated to the driver container by YARN and not the number of cores available 
in the host machine. This can easily impact the Spark-UI and the driver 
application performance, since the number of threads is far bigger than the 
true amount of allocated cores - which increases the number of unrequired 
preemptions and context switches

The solution is to retrieve the number of cores allocated to the Application 
Master by YARN instead.

Once confirmed the problem, I can work on retrieving that information and 
making a PR.

  was:
When making a call to:
{code:java}
sc.binaryFiles(somePath){code}
 

It creates a BinaryFileRDD. Some sections of that code are run inside the 
driver container. The current source code for BinaryFileRDD is [available 
here|[https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala]:

The problematic line is:

 
{code:java}
conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, 
Runtime.getRuntime.availableProcessors().toString)
{code}
 

This line sets the number of Threads to be used (in the case of multi-threading 
reading) to the number of cores (including Hyper Threading ones) available one 
the driver host machine.

This number is false, since what really matters is the number of cores 
allocated to the driver container by YARN and not the number of cores available 
in the host machine. This can easily impact the Spark-UI and the driver 
application performance, since the number of threads is far bigger than the 
true amount of allocated cores - which increases the number of unrequired 
preemptions and context switches

The solution is to retrieve the number of cores allocated to the Application 
Master by YARN instead.

Once confirmed the problem, I can work on retrieving that information and 
making a PR.


> Binary Files RDD allocates false number of threads
> --
>
> Key: SPARK-28850
> URL: https://issues.apache.org/jira/browse/SPARK-28850
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.3
>Reporter: Marco Lotz
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When making a call to:
> {code:java}
> sc.binaryFiles(somePath){code}
>  
> It creates a BinaryFileRDD. Some sections of that code are run inside the 
> driver container. The current source code for BinaryFileRDD is [available 
> here|[https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala]]
>  :
> The problematic line is:
>  
> {code:java}
> conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, 
> Runtime.getRuntime.availableProcessors().toString)
> {code}
>  
> This line sets the number of Threads to be used (in the case of 
> multi-threading reading) to the number of cores (including Hyper Threading 
> ones) available one the driver host machine.
> This number is false, since what really matters is the number of cores 
> allocated to the driver container by YARN and not the number of cores 
> available in the host machine. This can easily impact the Spark-UI and the 
> driver application performance, since the number of threads is far bigger 
> than the true amount of allocated cores - which increases the number of 
> unrequired preemptions and context switches
> The solution is to retrieve the number of cores allocated to the Application 
> Master by YARN instead.
> Once confirmed the problem, I can work on retrieving that information and 
> making a PR.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28850) Binary Files RDD allocates false number of threads

2019-08-22 Thread Marco Lotz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Lotz updated SPARK-28850:
---
Description: 
When making a call to:
{code:java}
sc.binaryFiles(somePath){code}
 

It creates a BinaryFileRDD. Some sections of that code are run inside the 
driver container. The current source code for BinaryFileRDD is available here:
 
[https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala

]

The problematic line is:

 
{code:java}
conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, 
Runtime.getRuntime.availableProcessors().toString)
{code}
 

This line sets the number of Threads to be used (in the case of multi-threading 
reading) to the number of cores (including Hyper Threading ones) available one 
the driver host machine.

This number is false, since what really matters is the number of cores 
allocated to the driver container by YARN and not the number of cores available 
in the host machine. This can easily impact the Spark-UI and the driver 
application performance, since the number of threads is far bigger than the 
true amount of allocated cores - which increases the number of unrequired 
preemptions and context switches

The solution is to retrieve the number of cores allocated to the Application 
Master by YARN instead.

Once confirmed the problem, I can work on retrieving that information and 
making a PR.

  was:
When making a call to:

```scala

sc.binaryFiles(somePath)

```

It creates a BinaryFileRDD. Some sections of that code are run inside the 
driver container. The current source code for BinaryFileRDD is available here:
[https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala

]The problematic line is:

```scala

conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, 
Runtime.getRuntime.availableProcessors().toString)

```

This line sets the number of Threads to be used (in the case of multi-threading 
reading) to the number of cores (including Hyper Threading ones) available one 
the driver host machine.


This number is false, since what really matters is the number of cores 
allocated to the driver container by YARN and not the number of cores available 
in the host machine. This can easily impact the Spark-UI and the driver 
application performance, since the number of threads is far bigger than the 
true amount of allocated cores - which increases the number of unrequired 
preemptions and context switches

The solution is to retrieve the number of cores allocated to the Application 
Master by YARN instead.

Once confirmed the problem, I can work on retrieving that information and 
making a PR.


> Binary Files RDD allocates false number of threads
> --
>
> Key: SPARK-28850
> URL: https://issues.apache.org/jira/browse/SPARK-28850
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.3
>Reporter: Marco Lotz
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When making a call to:
> {code:java}
> sc.binaryFiles(somePath){code}
>  
> It creates a BinaryFileRDD. Some sections of that code are run inside the 
> driver container. The current source code for BinaryFileRDD is available here:
>  
> [https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala
> ]
> The problematic line is:
>  
> {code:java}
> conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, 
> Runtime.getRuntime.availableProcessors().toString)
> {code}
>  
> This line sets the number of Threads to be used (in the case of 
> multi-threading reading) to the number of cores (including Hyper Threading 
> ones) available one the driver host machine.
> This number is false, since what really matters is the number of cores 
> allocated to the driver container by YARN and not the number of cores 
> available in the host machine. This can easily impact the Spark-UI and the 
> driver application performance, since the number of threads is far bigger 
> than the true amount of allocated cores - which increases the number of 
> unrequired preemptions and context switches
> The solution is to retrieve the number of cores allocated to the Application 
> Master by YARN instead.
> Once confirmed the problem, I can work on retrieving that information and 
> making a PR.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28850) Binary Files RDD allocates false number of threads

2019-08-22 Thread Marco Lotz (Jira)

Marco Lotz created SPARK-28850:
--

 Summary: Binary Files RDD allocates false number of threads
 Key: SPARK-28850
 URL: https://issues.apache.org/jira/browse/SPARK-28850
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.4.3
Reporter: Marco Lotz


When making a call to:

```scala

sc.binaryFiles(somePath)

```

It creates a BinaryFileRDD. Some sections of that code are run inside the 
driver container. The current source code for BinaryFileRDD is available here:
[https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/rdd/BinaryFileRDD.scala

]The problematic line is:

```scala

conf.setIfUnset(FileInputFormat.LIST_STATUS_NUM_THREADS, 
Runtime.getRuntime.availableProcessors().toString)

```

This line sets the number of Threads to be used (in the case of multi-threading 
reading) to the number of cores (including Hyper Threading ones) available one 
the driver host machine.


This number is false, since what really matters is the number of cores 
allocated to the driver container by YARN and not the number of cores available 
in the host machine. This can easily impact the Spark-UI and the driver 
application performance, since the number of threads is far bigger than the 
true amount of allocated cores - which increases the number of unrequired 
preemptions and context switches

The solution is to retrieve the number of cores allocated to the Application 
Master by YARN instead.

Once confirmed the problem, I can work on retrieving that information and 
making a PR.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28779) CSV writer doesn't handle older Mac line endings

2019-08-22 Thread nicolas paris (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913151#comment-16913151
 ] 

nicolas paris commented on SPARK-28779:
---

good to know thanks

> CSV writer doesn't handle older Mac line endings
> 
>
> Key: SPARK-28779
> URL: https://issues.apache.org/jira/browse/SPARK-28779
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
>Reporter: nicolas paris
>Priority: Minor
>
> The spark csv writer does not consider "\r"  as a newline in string type 
> columns. As a result, the resulting csv are not quoted, and they get 
> corrupted.
> All \n, \r\n and \r should be considered as newline to allow robust csv 
> serialization.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28832) Document SHOW SCHEMAS statement in SQL Reference.

2019-08-22 Thread jobit mathew (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16913149#comment-16913149
 ] 

jobit mathew commented on SPARK-28832:
--

[~dkbiswal],But commands are different right, eventhough outputs are same.And 
who ever handles the SHOW DATABASE query JIRA , I am not sure ,he will mention 
about SHOW SCHEMA also.Else you can mention about SCHEMA also in the same JIRA 
,so i can close this .

> Document SHOW SCHEMAS statement in SQL Reference.
> -
>
> Key: SPARK-28832
> URL: https://issues.apache.org/jira/browse/SPARK-28832
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: jobit mathew
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28836) Introduce TPCDSSchema

2019-08-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28836.
--
Resolution: Won't Fix

See {{TPCDSQueryBenchmark}}. We already have the ways in Spark code base for 
testing purpose.

> Introduce TPCDSSchema
> -
>
> Key: SPARK-28836
> URL: https://issues.apache.org/jira/browse/SPARK-28836
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Priority: Minor
>
> This PR extracts the schema information of TPCDS tables into a separate class 
> called `TPCDSSchema` which can be reused for other testing purposes



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28844) Fix typo in SQLConf FILE_COMRESSION_FACTOR

2019-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28844.
---
Fix Version/s: 2.4.4
   2.3.4
   3.0.0
   Resolution: Fixed

Issue resolved by pull request 25538
[https://github.com/apache/spark/pull/25538]

> Fix typo in SQLConf FILE_COMRESSION_FACTOR
> --
>
> Key: SPARK-28844
> URL: https://issues.apache.org/jira/browse/SPARK-28844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3, 3.0.0, 2.4.3
>Reporter: ZhangYao
>Priority: Major
> Fix For: 3.0.0, 2.3.4, 2.4.4
>
>
> Fix the typo in SQLConf FILE_COMRESSION_FACTOR and change it to 
> FILE_COMPRESSION_FACTOR



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26895) When running spark 2.3 as a proxy user (--proxy-user), SparkSubmit fails to resolve globs owned by target user

2019-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26895:
--
Fix Version/s: 2.3.4

> When running spark 2.3 as a proxy user (--proxy-user), SparkSubmit fails to 
> resolve globs owned by target user
> --
>
> Key: SPARK-26895
> URL: https://issues.apache.org/jira/browse/SPARK-26895
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Alessandro Bellina
>Assignee: Alessandro Bellina
>Priority: Critical
> Fix For: 2.3.4, 2.4.4, 3.0.0
>
>
> We are resolving globs in SparkSubmit here (by way of 
> prepareSubmitEnvironment) without first going into a doAs:
> https://github.com/apache/spark/blob/6c18d8d8079ac4d2d6dc7539601ab83fc5b51760/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L143
> Without first entering a doAs, as done here:
> [https://github.com/apache/spark/blob/6c18d8d8079ac4d2d6dc7539601ab83fc5b51760/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L151]
> So when running spark-submit with --proxy-user, and for example --archives, 
> it will fail to launch unless the location of the archive is open to the user 
> that executed spark-submit.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28847) Annotate HiveExternalCatalogVersionsSuite with ExtendedHiveTest

2019-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28847:
-

Assignee: Dongjoon Hyun

> Annotate HiveExternalCatalogVersionsSuite with ExtendedHiveTest
> ---
>
> Key: SPARK-28847
> URL: https://issues.apache.org/jira/browse/SPARK-28847
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> `HiveExternalCatalogVersionsSuite` is an outstanding test in terms of testing 
> time. This issue aims to allow skipping this test suite when we use 
> `ExtendedHiveTest`.
> !https://user-images.githubusercontent.com/9700541/63489184-4c75af00-c466-11e9-9e12-d250d4a23292.png!
> This issue aims to annotate `HiveExternalCatalogVersionsSuite` with 
> `ExtendedHiveTest`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28780) Delete the incorrect setWeightCol method in LinearSVCModel

2019-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28780:
--
Fix Version/s: 2.3.4

> Delete the incorrect setWeightCol method in LinearSVCModel
> --
>
> Key: SPARK-28780
> URL: https://issues.apache.org/jira/browse/SPARK-28780
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0, 2.4.0, 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>  Labels: release-notes
> Fix For: 2.3.4, 2.4.4, 3.0.0
>
>
> 1, the weightCol is only used in training, and should not be set in  
> LinearSVCModel;
> 2, the method 'def setWeightCol(value: Double): this.type = set(threshold, 
> value)' is wrongly defined, since value should be a string and weightCol 
> instead of threshold should be set.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28780) Delete the incorrect setWeightCol method in LinearSVCModel

2019-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28780:
--
Fix Version/s: 2.4.4

> Delete the incorrect setWeightCol method in LinearSVCModel
> --
>
> Key: SPARK-28780
> URL: https://issues.apache.org/jira/browse/SPARK-28780
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0, 2.4.0, 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>  Labels: release-notes
> Fix For: 2.4.4, 3.0.0
>
>
> 1, the weightCol is only used in training, and should not be set in  
> LinearSVCModel;
> 2, the method 'def setWeightCol(value: Double): this.type = set(threshold, 
> value)' is wrongly defined, since value should be a string and weightCol 
> instead of threshold should be set.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28847) Annotate HiveExternalCatalogVersionsSuite with ExtendedHiveTest

2019-08-22 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28847.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25550
[https://github.com/apache/spark/pull/25550]

> Annotate HiveExternalCatalogVersionsSuite with ExtendedHiveTest
> ---
>
> Key: SPARK-28847
> URL: https://issues.apache.org/jira/browse/SPARK-28847
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> `HiveExternalCatalogVersionsSuite` is an outstanding test in terms of testing 
> time. This issue aims to allow skipping this test suite when we use 
> `ExtendedHiveTest`.
> !https://user-images.githubusercontent.com/9700541/63489184-4c75af00-c466-11e9-9e12-d250d4a23292.png!
> This issue aims to annotate `HiveExternalCatalogVersionsSuite` with 
> `ExtendedHiveTest`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until it is killed manually. Here's 
the log you can see, there's no any log after spill the shuffle data to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until it is killed manually. Here's 
> the log you can see, there's no any log after spill the shuffle data to disk.
>  !95330.png! 
> And here is the thread dump, we could see that it always calls native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, and the system usage is pretty high, here is the 
> screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28821) Document COMPUTE STAT in SQL Reference

2019-08-22 Thread ABHISHEK KUMAR GUPTA (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ABHISHEK KUMAR GUPTA resolved SPARK-28821.
--
Resolution: Duplicate

Will be covered in ANALYZE TABLE JIRA for document

> Document COMPUTE STAT in SQL Reference
> --
>
> Key: SPARK-28821
> URL: https://issues.apache.org/jira/browse/SPARK-28821
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 2.4.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28848) insert overwrite local directory stored as parquet does not creates snappy.parquet data file at local directory path

2019-08-22 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-28848:
-
Description: 
{code}
0: jdbc:hive2://10.18.18.214:23040/func> insert overwrite local directory 
'/opt/trash4/' stored as parquet select * from trash1 a where a.country='PAK';
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (1.368 seconds)
{code}

Data file at local directory path:

{code}
vm1:/opt/trash4 # ll
total 12
-rw-r--r-- 1 root root   8 Aug 22 14:30 ._SUCCESS.crc
-rw-r--r-- 1 root root  16 Aug 22 14:30 
.part-1-2b17ec6a-ef7e-4b45-927e-f93b88ff4f65-c000.crc
-rw-r--r-- 1 root root   0 Aug 22 14:30 _SUCCESS
-rw-r--r-- 1 root root 619 Aug 22 14:30 
part-1-2b17ec6a-ef7e-4b45-927e-f93b88ff4f65-c000
{code}

  was:

0: jdbc:hive2://10.18.18.214:23040/func> insert overwrite local directory 
'/opt/trash4/' stored as parquet select * from trash1 a where a.country='PAK';
+-+--+
| Result  |
+-+--+
+-+--+
No rows selected (1.368 seconds)
Data file at local directory path:
vm1:/opt/trash4 # ll
total 12
-rw-r--r-- 1 root root   8 Aug 22 14:30 ._SUCCESS.crc
-rw-r--r-- 1 root root  16 Aug 22 14:30 
.part-1-2b17ec6a-ef7e-4b45-927e-f93b88ff4f65-c000.crc
-rw-r--r-- 1 root root   0 Aug 22 14:30 _SUCCESS
-rw-r--r-- 1 root root 619 Aug 22 14:30 
part-1-2b17ec6a-ef7e-4b45-927e-f93b88ff4f65-c000



> insert overwrite local directory  stored as parquet does not creates 
> snappy.parquet data file at local directory path
> ---
>
> Key: SPARK-28848
> URL: https://issues.apache.org/jira/browse/SPARK-28848
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/func> insert overwrite local directory 
> '/opt/trash4/' stored as parquet select * from trash1 a where a.country='PAK';
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (1.368 seconds)
> {code}
> Data file at local directory path:
> {code}
> vm1:/opt/trash4 # ll
> total 12
> -rw-r--r-- 1 root root   8 Aug 22 14:30 ._SUCCESS.crc
> -rw-r--r-- 1 root root  16 Aug 22 14:30 
> .part-1-2b17ec6a-ef7e-4b45-927e-f93b88ff4f65-c000.crc
> -rw-r--r-- 1 root root   0 Aug 22 14:30 _SUCCESS
> -rw-r--r-- 1 root root 619 Aug 22 14:30 
> part-1-2b17ec6a-ef7e-4b45-927e-f93b88ff4f65-c000
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until killed manually. Here's the 
> log you can see, there's no any log after spill the shuffle files to disk.
>  !95330.png! 
> And here is the thread dump, we could see that it always calls native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, here is the screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, and the system usage is pretty high, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it always calls native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until killed manually. Here's the 
> log you can see, there's no any log after spill the shuffle files to disk.
>  !95330.png! 
> And here is the thread dump, we could see that it always calls native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, and the system usage is pretty high, here is the 
> screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk.

 !95330.png! 

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk for 
several hours.

 !95330.png! 

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until killed manually. Here's the 
> log you can see, there's no any log after spill the shuffle files to disk.
>  !95330.png! 
> And here is the thread dump, we could see that it is calling native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, here is the screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Attachment: D18F4.png
95330.png
91ADA.png

> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until killed manually. Here's the 
> log you can see, there's no any log after spill the shuffle files to disk for 
> several hours.
> And here is the thread dump, we could see that it is calling native method 
> {{size0}}.
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, here is the screenshot. 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-28849:

Description: 
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk for 
several hours.

 !95330.png! 

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

 !91ADA.png! 

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

 !D18F4.png! 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.

  was:
Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk for 
several hours.

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.


> Spark's UnsafeShuffleWriter may run into infinite loop in transferTo 
> occasionally
> -
>
> Key: SPARK-28849
> URL: https://issues.apache.org/jira/browse/SPARK-28849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Saisai Shao
>Priority: Major
> Attachments: 91ADA.png, 95330.png, D18F4.png
>
>
> Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
> {{transferTo}} occasionally. What we saw is that when merging shuffle temp 
> file, the task is hung for several hours until killed manually. Here's the 
> log you can see, there's no any log after spill the shuffle files to disk for 
> several hours.
>  !95330.png! 
> And here is the thread dump, we could see that it is calling native method 
> {{size0}}.
>  !91ADA.png! 
> And we use strace to trace the system, we found that this thread is always 
> calling {{fstat}}, here is the screenshot. 
>  !D18F4.png! 
> We didn't find the root cause here, I guess it might be related to FS or disk 
> issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28849) Spark's UnsafeShuffleWriter may run into infinite loop in transferTo occasionally

2019-08-22 Thread Saisai Shao (Jira)

Saisai Shao created SPARK-28849:
---

 Summary: Spark's UnsafeShuffleWriter may run into infinite loop in 
transferTo occasionally
 Key: SPARK-28849
 URL: https://issues.apache.org/jira/browse/SPARK-28849
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Saisai Shao


Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk for 
several hours.

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

87 matches

Mail list logo