[jira] [Resolved] (SPARK-34046) Use join hint in test cases for Join

2021-01-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34046.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31087
[https://github.com/apache/spark/pull/31087]

> Use join hint in test cases for Join
> 
>
> Key: SPARK-34046
> URL: https://issues.apache.org/jira/browse/SPARK-34046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 3.2.0
>
>
> There are some existing test cases that constructing various joins by tuning 
> the SQL configuration AUTO_BROADCASTJOIN_THRESHOLD, 
> PREFER_SORTMERGEJOIN,SHUFFLE_PARTITIONS, etc. 
> This can be tricky. Constructing a specific join by using join hint can be 
> simpler and more robust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34039) [DSv2] ReplaceTable should invalidate cache

2021-01-07 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261042#comment-17261042
 ] 

Dongjoon Hyun commented on SPARK-34039:
---

Got it. Thanks for the explanation, [~csun].

> [DSv2] ReplaceTable should invalidate cache
> ---
>
> Key: SPARK-34039
> URL: https://issues.apache.org/jira/browse/SPARK-34039
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
> Fix For: 3.2.0
>
>
> Similar to SPARK-33492, which handles the {{ReplaceTableAsSelect}} case, we 
> should also invalidate table cache in {{ReplaceTable}} v2 command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34039) [DSv2] ReplaceTable should invalidate cache

2021-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34039:
--
Priority: Minor  (was: Major)

> [DSv2] ReplaceTable should invalidate cache
> ---
>
> Key: SPARK-34039
> URL: https://issues.apache.org/jira/browse/SPARK-34039
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
> Fix For: 3.2.0
>
>
> Similar to SPARK-33492, which handles the {{ReplaceTableAsSelect}} case, we 
> should also invalidate table cache in {{ReplaceTable}} v2 command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34039) [DSv2] ReplaceTable should invalidate cache

2021-01-07 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261036#comment-17261036
 ] 

Chao Sun commented on SPARK-34039:
--

I'm not sure if this is a bug since:
1. the command is only available in v2 so there is no inconsistency like we saw 
in some other JIRAs.
2. there seems to be no correctness issue: even without the fix, the table 
cache can't be queried because the table itself is changed.



> [DSv2] ReplaceTable should invalidate cache
> ---
>
> Key: SPARK-34039
> URL: https://issues.apache.org/jira/browse/SPARK-34039
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> Similar to SPARK-33492, which handles the {{ReplaceTableAsSelect}} case, we 
> should also invalidate table cache in {{ReplaceTable}} v2 command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34046) Use join hint in test cases for Join

2021-01-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261022#comment-17261022
 ] 

Apache Spark commented on SPARK-34046:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31087

> Use join hint in test cases for Join
> 
>
> Key: SPARK-34046
> URL: https://issues.apache.org/jira/browse/SPARK-34046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> There are some existing test cases that constructing various joins by tuning 
> the SQL configuration AUTO_BROADCASTJOIN_THRESHOLD, 
> PREFER_SORTMERGEJOIN,SHUFFLE_PARTITIONS, etc. 
> This can be tricky. Constructing a specific join by using join hint can be 
> simpler and more robust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34046) Use join hint in test cases for Join

2021-01-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34046:


Assignee: Apache Spark  (was: Gengliang Wang)

> Use join hint in test cases for Join
> 
>
> Key: SPARK-34046
> URL: https://issues.apache.org/jira/browse/SPARK-34046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> There are some existing test cases that constructing various joins by tuning 
> the SQL configuration AUTO_BROADCASTJOIN_THRESHOLD, 
> PREFER_SORTMERGEJOIN,SHUFFLE_PARTITIONS, etc. 
> This can be tricky. Constructing a specific join by using join hint can be 
> simpler and more robust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34046) Use join hint in test cases for Join

2021-01-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261021#comment-17261021
 ] 

Apache Spark commented on SPARK-34046:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31087

> Use join hint in test cases for Join
> 
>
> Key: SPARK-34046
> URL: https://issues.apache.org/jira/browse/SPARK-34046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> There are some existing test cases that constructing various joins by tuning 
> the SQL configuration AUTO_BROADCASTJOIN_THRESHOLD, 
> PREFER_SORTMERGEJOIN,SHUFFLE_PARTITIONS, etc. 
> This can be tricky. Constructing a specific join by using join hint can be 
> simpler and more robust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34046) Use join hint in test cases for Join

2021-01-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34046:


Assignee: Gengliang Wang  (was: Apache Spark)

> Use join hint in test cases for Join
> 
>
> Key: SPARK-34046
> URL: https://issues.apache.org/jira/browse/SPARK-34046
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
>
> There are some existing test cases that constructing various joins by tuning 
> the SQL configuration AUTO_BROADCASTJOIN_THRESHOLD, 
> PREFER_SORTMERGEJOIN,SHUFFLE_PARTITIONS, etc. 
> This can be tricky. Constructing a specific join by using join hint can be 
> simpler and more robust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34046) Use join hint in test cases for Join

2021-01-07 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-34046:
--

 Summary: Use join hint in test cases for Join
 Key: SPARK-34046
 URL: https://issues.apache.org/jira/browse/SPARK-34046
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 3.0.0, 3.1.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


There are some existing test cases that constructing various joins by tuning 
the SQL configuration AUTO_BROADCASTJOIN_THRESHOLD, 
PREFER_SORTMERGEJOIN,SHUFFLE_PARTITIONS, etc. 
This can be tricky. Constructing a specific join by using join hint can be 
simpler and more robust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34005) Update peak memory metrics for each Executor on task end.

2021-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34005.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31029
[https://github.com/apache/spark/pull/31029]

> Update peak memory metrics for each Executor on task end.
> -
>
> Key: SPARK-34005
> URL: https://issues.apache.org/jira/browse/SPARK-34005
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> Like other peak memory metrics (e.g, stage, executors in a stage), it's 
> better to update the peak memory metrics for each Executor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34005) Update peak memory metrics for each Executor on task end.

2021-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34005:
--
Affects Version/s: (was: 3.1.0)

> Update peak memory metrics for each Executor on task end.
> -
>
> Key: SPARK-34005
> URL: https://issues.apache.org/jira/browse/SPARK-34005
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> Like other peak memory metrics (e.g, stage, executors in a stage), it's 
> better to update the peak memory metrics for each Executor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26399) Add new stage-level REST APIs and parameters

2021-01-07 Thread Ron Hu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260856#comment-17260856
 ] 

Ron Hu edited comment on SPARK-26399 at 1/8/21, 5:18 AM:
-

The initial description of this jira has this statement:  "filtering for task 
status, and returning tasks that match (for example, FAILED tasks)"

To achieve the above statement, we need an new endpoint like this: 
/applications/[app-id]/stages?taskstatus=[FAILED|KILLED|SUCCESS]

If a user specifies /applications/[app-id]/stages?taskstatus=KILLED, then we 
generate a json file to contain all the killed task information from all the 
stages.  This way can help users monitor all the killed tasks.  For example, 
when a Spark user enables speculation, he needs the information of all the 
killed tasks so that he can monitor the benefit/cost brought by speculation.

I attach a sample json file  [^lispark230_restapi_ex2_stages_failedTasks.json]  
which contains the failed tasks and the corresponding stages for reference.


was (Author: ron8hu):
The initial description of this jira has this statement:  "filtering for task 
status, and returning tasks that match (for example, FAILED tasks)"

To achieve the above statement, we need an new endpoint like this: 
/applications/[app-id]/stages?taskstatus=[FAILED|KILLED|SUCCESS]

If a user specifies /applications/[app-id]/stages?taskstatus=KILLED, then we 
generate a json file to contain all the killed task information from all the 
stages.  This way can help users monitor all the killed tasks.  For example, a 
Spark user enables speculation, he needs the information of all the killed 
tasks so that he can monitor the benefit/cost brought by speculation.

I attach a sample json file  [^lispark230_restapi_ex2_stages_failedTasks.json]  
which contains the failed tasks and the corresponding stages for reference.

> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
> Attachments: executorMetricsSummary.json, 
> lispark230_restapi_ex2_stages_failedTasks.json, 
> lispark230_restapi_ex2_stages_withSummaries.json, 
> stage_executorSummary_image1.png
>
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http:// server>:18080/api/v1/applicationsexecutorMetricsSummary{code}
> Add parameters to the stages REST API to specify:
>  * filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
>  * task metric quantiles, add adding the task summary if specified
>  * executor metric quantiles, and adding the executor summary if specified
> Note that the above description is too brief to be clear.  Ron Hu added the 
> additional details to explain the use cases from the downstream products.  
> See the comments dated 1/07/2021 with a couple of sample json files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33938) Optimize Like Any/All by LikeSimplification

2021-01-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33938:
---

Assignee: jiaan.geng

> Optimize Like Any/All by LikeSimplification 
> 
>
> Key: SPARK-33938
> URL: https://issues.apache.org/jira/browse/SPARK-33938
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: jiaan.geng
>Priority: Major
>
> We should optimize Like Any/All by LikeSimplification to improve performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33938) Optimize Like Any/All by LikeSimplification

2021-01-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33938.
-
Fix Version/s: 3.1.1
   Resolution: Fixed

Issue resolved by pull request 31063
[https://github.com/apache/spark/pull/31063]

> Optimize Like Any/All by LikeSimplification 
> 
>
> Key: SPARK-33938
> URL: https://issues.apache.org/jira/browse/SPARK-33938
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.1
>
>
> We should optimize Like Any/All by LikeSimplification to improve performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34039) [DSv2] ReplaceTable should invalidate cache

2021-01-07 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17261007#comment-17261007
 ] 

Dongjoon Hyun commented on SPARK-34039:
---

[~csun]. Is this an `Improvement` instead of `Bug`?

> [DSv2] ReplaceTable should invalidate cache
> ---
>
> Key: SPARK-34039
> URL: https://issues.apache.org/jira/browse/SPARK-34039
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> Similar to SPARK-33492, which handles the {{ReplaceTableAsSelect}} case, we 
> should also invalidate table cache in {{ReplaceTable}} v2 command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34039) [DSv2] ReplaceTable should invalidate cache

2021-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34039:
-

Assignee: Chao Sun

> [DSv2] ReplaceTable should invalidate cache
> ---
>
> Key: SPARK-34039
> URL: https://issues.apache.org/jira/browse/SPARK-34039
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> Similar to SPARK-33492, which handles the {{ReplaceTableAsSelect}} case, we 
> should also invalidate table cache in {{ReplaceTable}} v2 command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34039) [DSv2] ReplaceTable should invalidate cache

2021-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34039.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31081
[https://github.com/apache/spark/pull/31081]

> [DSv2] ReplaceTable should invalidate cache
> ---
>
> Key: SPARK-34039
> URL: https://issues.apache.org/jira/browse/SPARK-34039
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> Similar to SPARK-33492, which handles the {{ReplaceTableAsSelect}} case, we 
> should also invalidate table cache in {{ReplaceTable}} v2 command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34045) OneVsRestModel.transform should not call setter of submodels

2021-01-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260969#comment-17260969
 ] 

Apache Spark commented on SPARK-34045:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/31086

> OneVsRestModel.transform should not call setter of submodels
> 
>
> Key: SPARK-34045
> URL: https://issues.apache.org/jira/browse/SPARK-34045
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> featuresCol of submodels maybe changed in transform:
> {code:java}
>  scala> val df = 
> spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
> 21/01/08 09:52:01 WARN LibSVMFileFormat: 'numFeatures' option not specified, 
> determining the number of features by going though the input. If you know the 
> number in advance, please specify it via 'numFeatures' option to avoid the 
> extra scan.
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
> scala> val lr = new 
> LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_3003cb3321a1
> scala> val ovr = new OneVsRest().setClassifier(lr)
> ovr: org.apache.spark.ml.classification.OneVsRest = oneVsRest_b2ec3ec45dbf
> scala> val ovrm = ovr.fit(df)
> 21/01/08 09:52:05 WARN BLAS: Failed to load implementation from: 
> com.github.fommil.netlib.NativeSystemBLAS
> 21/01/08 09:52:05 WARN BLAS: Failed to load implementation from: 
> com.github.fommil.netlib.NativeRefBLAS
> ovrm: org.apache.spark.ml.classification.OneVsRestModel = OneVsRestModel: 
> uid=oneVsRest_b2ec3ec45dbf, classifier=logreg_3003cb3321a1, numClasses=3, 
> numFeatures=4
> scala> val df2 = df.withColumnRenamed("features", "features2")
> df2: org.apache.spark.sql.DataFrame = [label: double, features2: vector]
> scala> ovrm.setFeaturesCol("features2")
> res0: ovrm.type = OneVsRestModel: uid=oneVsRest_b2ec3ec45dbf, 
> classifier=logreg_3003cb3321a1, numClasses=3, numFeatures=4
> scala> ovrm.models.map(_.getFeaturesCol)
> res1: Array[String] = Array(features, features, features)
> scala> ovrm.transform(df2)
> res2: org.apache.spark.sql.DataFrame = [label: double, features2: vector ... 
> 2 more fields]
> scala> ovrm.models.map(_.getFeaturesCol)
> res3: Array[String] = Array(features2, features2, features2)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34045) OneVsRestModel.transform should not call setter of submodels

2021-01-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34045:


Assignee: Apache Spark

> OneVsRestModel.transform should not call setter of submodels
> 
>
> Key: SPARK-34045
> URL: https://issues.apache.org/jira/browse/SPARK-34045
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> featuresCol of submodels maybe changed in transform:
> {code:java}
>  scala> val df = 
> spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
> 21/01/08 09:52:01 WARN LibSVMFileFormat: 'numFeatures' option not specified, 
> determining the number of features by going though the input. If you know the 
> number in advance, please specify it via 'numFeatures' option to avoid the 
> extra scan.
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
> scala> val lr = new 
> LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_3003cb3321a1
> scala> val ovr = new OneVsRest().setClassifier(lr)
> ovr: org.apache.spark.ml.classification.OneVsRest = oneVsRest_b2ec3ec45dbf
> scala> val ovrm = ovr.fit(df)
> 21/01/08 09:52:05 WARN BLAS: Failed to load implementation from: 
> com.github.fommil.netlib.NativeSystemBLAS
> 21/01/08 09:52:05 WARN BLAS: Failed to load implementation from: 
> com.github.fommil.netlib.NativeRefBLAS
> ovrm: org.apache.spark.ml.classification.OneVsRestModel = OneVsRestModel: 
> uid=oneVsRest_b2ec3ec45dbf, classifier=logreg_3003cb3321a1, numClasses=3, 
> numFeatures=4
> scala> val df2 = df.withColumnRenamed("features", "features2")
> df2: org.apache.spark.sql.DataFrame = [label: double, features2: vector]
> scala> ovrm.setFeaturesCol("features2")
> res0: ovrm.type = OneVsRestModel: uid=oneVsRest_b2ec3ec45dbf, 
> classifier=logreg_3003cb3321a1, numClasses=3, numFeatures=4
> scala> ovrm.models.map(_.getFeaturesCol)
> res1: Array[String] = Array(features, features, features)
> scala> ovrm.transform(df2)
> res2: org.apache.spark.sql.DataFrame = [label: double, features2: vector ... 
> 2 more fields]
> scala> ovrm.models.map(_.getFeaturesCol)
> res3: Array[String] = Array(features2, features2, features2)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34045) OneVsRestModel.transform should not call setter of submodels

2021-01-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260968#comment-17260968
 ] 

Apache Spark commented on SPARK-34045:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/31086

> OneVsRestModel.transform should not call setter of submodels
> 
>
> Key: SPARK-34045
> URL: https://issues.apache.org/jira/browse/SPARK-34045
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> featuresCol of submodels maybe changed in transform:
> {code:java}
>  scala> val df = 
> spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
> 21/01/08 09:52:01 WARN LibSVMFileFormat: 'numFeatures' option not specified, 
> determining the number of features by going though the input. If you know the 
> number in advance, please specify it via 'numFeatures' option to avoid the 
> extra scan.
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
> scala> val lr = new 
> LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_3003cb3321a1
> scala> val ovr = new OneVsRest().setClassifier(lr)
> ovr: org.apache.spark.ml.classification.OneVsRest = oneVsRest_b2ec3ec45dbf
> scala> val ovrm = ovr.fit(df)
> 21/01/08 09:52:05 WARN BLAS: Failed to load implementation from: 
> com.github.fommil.netlib.NativeSystemBLAS
> 21/01/08 09:52:05 WARN BLAS: Failed to load implementation from: 
> com.github.fommil.netlib.NativeRefBLAS
> ovrm: org.apache.spark.ml.classification.OneVsRestModel = OneVsRestModel: 
> uid=oneVsRest_b2ec3ec45dbf, classifier=logreg_3003cb3321a1, numClasses=3, 
> numFeatures=4
> scala> val df2 = df.withColumnRenamed("features", "features2")
> df2: org.apache.spark.sql.DataFrame = [label: double, features2: vector]
> scala> ovrm.setFeaturesCol("features2")
> res0: ovrm.type = OneVsRestModel: uid=oneVsRest_b2ec3ec45dbf, 
> classifier=logreg_3003cb3321a1, numClasses=3, numFeatures=4
> scala> ovrm.models.map(_.getFeaturesCol)
> res1: Array[String] = Array(features, features, features)
> scala> ovrm.transform(df2)
> res2: org.apache.spark.sql.DataFrame = [label: double, features2: vector ... 
> 2 more fields]
> scala> ovrm.models.map(_.getFeaturesCol)
> res3: Array[String] = Array(features2, features2, features2)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34045) OneVsRestModel.transform should not call setter of submodels

2021-01-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34045:


Assignee: (was: Apache Spark)

> OneVsRestModel.transform should not call setter of submodels
> 
>
> Key: SPARK-34045
> URL: https://issues.apache.org/jira/browse/SPARK-34045
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> featuresCol of submodels maybe changed in transform:
> {code:java}
>  scala> val df = 
> spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
> 21/01/08 09:52:01 WARN LibSVMFileFormat: 'numFeatures' option not specified, 
> determining the number of features by going though the input. If you know the 
> number in advance, please specify it via 'numFeatures' option to avoid the 
> extra scan.
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
> scala> val lr = new 
> LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_3003cb3321a1
> scala> val ovr = new OneVsRest().setClassifier(lr)
> ovr: org.apache.spark.ml.classification.OneVsRest = oneVsRest_b2ec3ec45dbf
> scala> val ovrm = ovr.fit(df)
> 21/01/08 09:52:05 WARN BLAS: Failed to load implementation from: 
> com.github.fommil.netlib.NativeSystemBLAS
> 21/01/08 09:52:05 WARN BLAS: Failed to load implementation from: 
> com.github.fommil.netlib.NativeRefBLAS
> ovrm: org.apache.spark.ml.classification.OneVsRestModel = OneVsRestModel: 
> uid=oneVsRest_b2ec3ec45dbf, classifier=logreg_3003cb3321a1, numClasses=3, 
> numFeatures=4
> scala> val df2 = df.withColumnRenamed("features", "features2")
> df2: org.apache.spark.sql.DataFrame = [label: double, features2: vector]
> scala> ovrm.setFeaturesCol("features2")
> res0: ovrm.type = OneVsRestModel: uid=oneVsRest_b2ec3ec45dbf, 
> classifier=logreg_3003cb3321a1, numClasses=3, numFeatures=4
> scala> ovrm.models.map(_.getFeaturesCol)
> res1: Array[String] = Array(features, features, features)
> scala> ovrm.transform(df2)
> res2: org.apache.spark.sql.DataFrame = [label: double, features2: vector ... 
> 2 more fields]
> scala> ovrm.models.map(_.getFeaturesCol)
> res3: Array[String] = Array(features2, features2, features2)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34045) OneVsRestModel.transform should not call setter of submodels

2021-01-07 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-34045:
-
Description: 
featuresCol of submodels maybe changed in transform:
{code:java}
 scala> val df = 
spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
21/01/08 09:52:01 WARN LibSVMFileFormat: 'numFeatures' option not specified, 
determining the number of features by going though the input. If you know the 
number in advance, please specify it via 'numFeatures' option to avoid the 
extra scan.
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> val lr = new 
LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
lr: org.apache.spark.ml.classification.LogisticRegression = logreg_3003cb3321a1

scala> val ovr = new OneVsRest().setClassifier(lr)
ovr: org.apache.spark.ml.classification.OneVsRest = oneVsRest_b2ec3ec45dbf

scala> val ovrm = ovr.fit(df)
21/01/08 09:52:05 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS
21/01/08 09:52:05 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS
ovrm: org.apache.spark.ml.classification.OneVsRestModel = OneVsRestModel: 
uid=oneVsRest_b2ec3ec45dbf, classifier=logreg_3003cb3321a1, numClasses=3, 
numFeatures=4

scala> val df2 = df.withColumnRenamed("features", "features2")
df2: org.apache.spark.sql.DataFrame = [label: double, features2: vector]

scala> ovrm.setFeaturesCol("features2")
res0: ovrm.type = OneVsRestModel: uid=oneVsRest_b2ec3ec45dbf, 
classifier=logreg_3003cb3321a1, numClasses=3, numFeatures=4


scala> ovrm.models.map(_.getFeaturesCol)
res1: Array[String] = Array(features, features, features)

scala> ovrm.transform(df2)
res2: org.apache.spark.sql.DataFrame = [label: double, features2: vector ... 2 
more fields]

scala> ovrm.models.map(_.getFeaturesCol)
res3: Array[String] = Array(features2, features2, features2)
{code}

  was:
featuresCol of submodels maybe changed in transform:
{code:java}
 scala> val df = 
spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
21/01/08 09:52:01 WARN LibSVMFileFormat: 'numFeatures' option not specified, 
determining the number of features by going though the input. If you know the 
number in advance, please specify it via 'numFeatures' option to avoid the 
extra scan.
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> val lr = new 
LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
lr: org.apache.spark.ml.classification.LogisticRegression = logreg_3003cb3321a1

scala> val ovr = new OneVsRest().setClassifier(lr)
ovr: org.apache.spark.ml.classification.OneVsRest = oneVsRest_b2ec3ec45dbf

scala> val ovrm = ovr.fit(df)
21/01/08 09:52:05 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS
21/01/08 09:52:05 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS
ovrm: org.apache.spark.ml.classification.OneVsRestModel = OneVsRestModel: 
uid=oneVsRest_b2ec3ec45dbf, classifier=logreg_3003cb3321a1, numClasses=3, 
numFeatures=4

scala> val df2 = df.withColumnRenamed("features", "features2")
df2: org.apache.spark.sql.DataFrame = [label: double, features2: vector]scala> 
ovrm.setFeaturesCol("features2")
res0: ovrm.type = OneVsRestModel: uid=oneVsRest_b2ec3ec45dbf, 
classifier=logreg_3003cb3321a1, numClasses=3, numFeatures=4


scala> ovrm.models.map(_.getFeaturesCol)
res1: Array[String] = Array(features, features, features)
scala> ovrm.transform(df2)
res2: org.apache.spark.sql.DataFrame = [label: double, features2: vector ... 2 
more fields]
scala> ovrm.models.map(_.getFeaturesCol)
res3: Array[String] = Array(features2, features2, features2)
{code}


> OneVsRestModel.transform should not call setter of submodels
> 
>
> Key: SPARK-34045
> URL: https://issues.apache.org/jira/browse/SPARK-34045
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Minor
>
> featuresCol of submodels maybe changed in transform:
> {code:java}
>  scala> val df = 
> spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
> 21/01/08 09:52:01 WARN LibSVMFileFormat: 'numFeatures' option not specified, 
> determining the number of features by going though the input. If you know the 
> number in advance, please specify it via 'numFeatures' option to avoid the 
> extra scan.
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
> scala> val lr = new 
> LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
> lr: 

[jira] [Created] (SPARK-34045) OneVsRestModel.transform should not call setter of submodels

2021-01-07 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-34045:


 Summary: OneVsRestModel.transform should not call setter of 
submodels
 Key: SPARK-34045
 URL: https://issues.apache.org/jira/browse/SPARK-34045
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.2.0
Reporter: zhengruifeng


featuresCol of submodels maybe changed in transform:
{code:java}
 scala> val df = 
spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
21/01/08 09:52:01 WARN LibSVMFileFormat: 'numFeatures' option not specified, 
determining the number of features by going though the input. If you know the 
number in advance, please specify it via 'numFeatures' option to avoid the 
extra scan.
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> val lr = new 
LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
lr: org.apache.spark.ml.classification.LogisticRegression = logreg_3003cb3321a1

scala> val ovr = new OneVsRest().setClassifier(lr)
ovr: org.apache.spark.ml.classification.OneVsRest = oneVsRest_b2ec3ec45dbf

scala> val ovrm = ovr.fit(df)
21/01/08 09:52:05 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS
21/01/08 09:52:05 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS
ovrm: org.apache.spark.ml.classification.OneVsRestModel = OneVsRestModel: 
uid=oneVsRest_b2ec3ec45dbf, classifier=logreg_3003cb3321a1, numClasses=3, 
numFeatures=4

scala> val df2 = df.withColumnRenamed("features", "features2")
df2: org.apache.spark.sql.DataFrame = [label: double, features2: vector]scala> 
ovrm.setFeaturesCol("features2")
res0: ovrm.type = OneVsRestModel: uid=oneVsRest_b2ec3ec45dbf, 
classifier=logreg_3003cb3321a1, numClasses=3, numFeatures=4


scala> ovrm.models.map(_.getFeaturesCol)
res1: Array[String] = Array(features, features, features)
scala> ovrm.transform(df2)
res2: org.apache.spark.sql.DataFrame = [label: double, features2: vector ... 2 
more fields]
scala> ovrm.models.map(_.getFeaturesCol)
res3: Array[String] = Array(features2, features2, features2)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33818) Doc `spark.sql.parser.quotedRegexColumnNames`

2021-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33818:
-

Assignee: angerszhu

> Doc `spark.sql.parser.quotedRegexColumnNames`
> -
>
> Key: SPARK-33818
> URL: https://issues.apache.org/jira/browse/SPARK-33818
> Project: Spark
>  Issue Type: Improvement
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> add the usage of {{spark.sql.parser.quotedRegexColumnNames}} to document 
> since it's useful



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33818) Doc `spark.sql.parser.quotedRegexColumnNames`

2021-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33818.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30816
[https://github.com/apache/spark/pull/30816]

> Doc `spark.sql.parser.quotedRegexColumnNames`
> -
>
> Key: SPARK-33818
> URL: https://issues.apache.org/jira/browse/SPARK-33818
> Project: Spark
>  Issue Type: Improvement
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0
>
>
> add the usage of {{spark.sql.parser.quotedRegexColumnNames}} to document 
> since it's useful



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33818) Doc `spark.sql.parser.quotedRegexColumnNames`

2021-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33818:
--
Fix Version/s: (was: 3.1.0)
   3.1.1

> Doc `spark.sql.parser.quotedRegexColumnNames`
> -
>
> Key: SPARK-33818
> URL: https://issues.apache.org/jira/browse/SPARK-33818
> Project: Spark
>  Issue Type: Improvement
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.1
>
>
> add the usage of {{spark.sql.parser.quotedRegexColumnNames}} to document 
> since it's useful



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33100) Support parse the sql statements with c-style comments

2021-01-07 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33100:
-
Fix Version/s: 3.0.2

> Support parse the sql statements with c-style comments
> --
>
> Key: SPARK-33100
> URL: https://issues.apache.org/jira/browse/SPARK-33100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: feiwang
>Assignee: feiwang
>Priority: Minor
> Fix For: 3.0.2, 3.1.0, 3.2.0
>
>
> Now the spark-sql does not support parse the sql statements with C-style 
> comments.
> For the sql statements:
> {code:java}
> /* SELECT 'test'; */
> SELECT 'test';
> {code}
> Would be split to two statements:
> The first: "/* SELECT 'test'"
> The second: "*/ SELECT 'test'"
> Then it would throw an exception because the first one is illegal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33863) Pyspark UDF wrongly changes timestamps to UTC

2021-01-07 Thread Nasir Ali (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nasir Ali updated SPARK-33863:
--
Description: 
*Problem*:

I have a dataframe with a ts (timestamp) column in UTC. If I create a new 
column using udf, pyspark udf wrongly changes timestamps into UTC time. ts 
(timestamp) column is already in UTC time. Therefore, pyspark udf should not 
convert ts (timestamp) column into UTC timestamp. 

I have used following configs to let spark know the timestamps are in UTC:

 
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, TimestampType
import datetime

spark = SparkSession.builder.config("spark.sql.session.timeZone", 
"UTC").getOrCreate()

df = spark.createDataFrame([("usr1",17.00, "2018-02-10T15:27:18+00:00"),
("usr1",13.00, "2018-02-11T12:27:18+00:00"),
("usr1",25.00, "2018-02-12T11:27:18+00:00"),
("usr1",20.00, "2018-02-13T15:27:18+00:00"),
("usr1",17.00, "2018-02-14T12:27:18+00:00"),
("usr2",99.00, "2018-02-15T11:27:18+00:00"),
("usr2",156.00, "2018-02-22T11:27:18+00:00")
],
   ["user","id", "ts"])

df = df.withColumn('ts', df.ts.cast('timestamp'))
df.show(truncate=False)

def some_time_udf(i):
if  datetime.time(5, 0)<=i.time() < datetime.time(12, 0):
tmp= "Morning: " + str(i)
elif  datetime.time(12, 0)<=i.time() < datetime.time(17, 0):
tmp= "Afternoon: " + str(i)
elif  datetime.time(17, 0)<=i.time() < datetime.time(21, 0):
tmp= "Evening"
elif  datetime.time(21, 0)<=i.time() < datetime.time(0, 0):
tmp= "Night"
elif  datetime.time(0, 0)<=i.time() < datetime.time(5, 0):
tmp= "Night"
return tmp

udf = F.udf(some_time_udf,StringType())
df.withColumn("day_part", udf(df.ts)).show(truncate=False)


{code}
 

Below is the output of the above code:
{code:java}
++-+---++
|user|id   |ts |day_part|
++-+---++
|usr1|17.0 |2018-02-10 15:27:18|Morning: 2018-02-10 09:27:18|
|usr1|13.0 |2018-02-11 12:27:18|Morning: 2018-02-11 06:27:18|
|usr1|25.0 |2018-02-12 11:27:18|Morning: 2018-02-12 05:27:18|
|usr1|20.0 |2018-02-13 15:27:18|Morning: 2018-02-13 09:27:18|
|usr1|17.0 |2018-02-14 12:27:18|Morning: 2018-02-14 06:27:18|
|usr2|99.0 |2018-02-15 11:27:18|Morning: 2018-02-15 05:27:18|
|usr2|156.0|2018-02-22 11:27:18|Morning: 2018-02-22 05:27:18|
++-+---++
{code}
Above output is incorrect. You can see ts and day_part columns don't have same 
timestamps. Below is the output I would expect:

 
{code:java}
++-+---++
|user|id   |ts |day_part|
++-+---++
|usr1|17.0 |2018-02-10 15:27:18|Afternoon: 2018-02-10 15:27:18|
|usr1|13.0 |2018-02-11 12:27:18|Afternoon: 2018-02-11 12:27:18|
|usr1|25.0 |2018-02-12 11:27:18|Morning: 2018-02-12 11:27:18|
|usr1|20.0 |2018-02-13 15:27:18|Afternoon: 2018-02-13 15:27:18|
|usr1|17.0 |2018-02-14 12:27:18|Afternoon: 2018-02-14 12:27:18|
|usr2|99.0 |2018-02-15 11:27:18|Morning: 2018-02-15 11:27:18|
|usr2|156.0|2018-02-22 11:27:18|Morning: 2018-02-22 11:27:18|
++-+---++{code}
 

  was:
*Problem*:

I have a dataframe with a ts (timestamp) column in UTC. If I create a new 
column using udf, pyspark udf wrongly changes timestamps into UTC time. ts 
(timestamp) column is already in UTC time. Therefore, pyspark udf should not 
convert ts (timestamp) column into UTC timestamp. 

I have used following configs to let spark know the timestamps are in UTC:

 
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, TimestampType
import datetime

spark = SparkSession.builder.config("spark.sql.session.timeZone", 
"UTC").getOrCreate()

df = spark.createDataFrame([("usr1",17.00, "2018-02-10T15:27:18+00:00"),
("usr1",13.00, "2018-02-11T12:27:18+00:00"),
("usr1",25.00, "2018-02-12T11:27:18+00:00"),

[jira] [Updated] (SPARK-33863) Pyspark UDF wrongly changes timestamps to UTC

2021-01-07 Thread Nasir Ali (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nasir Ali updated SPARK-33863:
--
Description: 
*Problem*:

I have a dataframe with a ts (timestamp) column in UTC. If I create a new 
column using udf, pyspark udf wrongly changes timestamps into UTC time. ts 
(timestamp) column is already in UTC time. Therefore, pyspark udf should not 
convert ts (timestamp) column into UTC timestamp. 

I have used following configs to let spark know the timestamps are in UTC:

 
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, TimestampType
import datetime

spark = SparkSession.builder.config("spark.sql.session.timeZone", 
"UTC").getOrCreate()

df = spark.createDataFrame([("usr1",17.00, "2018-02-10T15:27:18+00:00"),
("usr1",13.00, "2018-02-11T12:27:18+00:00"),
("usr1",25.00, "2018-02-12T11:27:18+00:00"),
("usr1",20.00, "2018-02-13T15:27:18+00:00"),
("usr1",17.00, "2018-02-14T12:27:18+00:00"),
("usr2",99.00, "2018-02-15T11:27:18+00:00"),
("usr2",156.00, "2018-02-22T11:27:18+00:00")
],
   ["user","id", "ts"])

df = df.withColumn('ts', df.ts.cast('timestamp'))
df.show(truncate=False)

def some_time_udf(i):
if  datetime.time(5, 0)<=i.time() < datetime.time(12, 0):
tmp= "Morning: " + str(i)
elif  datetime.time(12, 0)<=i.time() < datetime.time(17, 0):
tmp= str(i)
elif  datetime.time(17, 0)<=i.time() < datetime.time(21, 0):
tmp= "Evening"
elif  datetime.time(21, 0)<=i.time() < datetime.time(0, 0):
tmp= "Night"
elif  datetime.time(0, 0)<=i.time() < datetime.time(5, 0):
tmp= "Night"
return tmp

udf = F.udf(some_time_udf,StringType())
df.withColumn("day_part", udf(df.ts)).show(truncate=False)


{code}
 

Below is the output of the above code:
{code:java}
++-+---++
|user|id   |ts |day_part|
++-+---++
|usr1|17.0 |2018-02-10 15:27:18|Morning: 2018-02-10 09:27:18|
|usr1|13.0 |2018-02-11 12:27:18|Morning: 2018-02-11 06:27:18|
|usr1|25.0 |2018-02-12 11:27:18|Morning: 2018-02-12 05:27:18|
|usr1|20.0 |2018-02-13 15:27:18|Morning: 2018-02-13 09:27:18|
|usr1|17.0 |2018-02-14 12:27:18|Morning: 2018-02-14 06:27:18|
|usr2|99.0 |2018-02-15 11:27:18|Morning: 2018-02-15 05:27:18|
|usr2|156.0|2018-02-22 11:27:18|Morning: 2018-02-22 05:27:18|
++-+---++
{code}
Above output is incorrect. You can see ts and day_part columns don't have same 
timestamps. Below is the output I would expect:

 
{code:java}
++-+---++
|user|id   |ts |day_part|
++-+---++
|usr1|17.0 |2018-02-10 15:27:18|Afternoon: 2018-02-10 15:27:18|
|usr1|13.0 |2018-02-11 12:27:18|Afternoon: 2018-02-11 12:27:18|
|usr1|25.0 |2018-02-12 11:27:18|Morning: 2018-02-12 11:27:18|
|usr1|20.0 |2018-02-13 15:27:18|Afternoon: 2018-02-13 15:27:18|
|usr1|17.0 |2018-02-14 12:27:18|Afternoon: 2018-02-14 12:27:18|
|usr2|99.0 |2018-02-15 11:27:18|Morning: 2018-02-15 11:27:18|
|usr2|156.0|2018-02-22 11:27:18|Morning: 2018-02-22 11:27:18|
++-+---++{code}
 

  was:
*Problem*:

I have a dataframe with a ts (timestamp) column in UTC. If I create a new 
column using udf, pyspark udf wrongly changes timestamps into UTC time. ts 
(timestamp) column is already in UTC time. Therefore, pyspark udf should not 
convert ts (timestamp) column into UTC timestamp. 

I have used following configs to let spark know the timestamps are in UTC:

 
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, TimestampType
import datetime

spark = SparkSession.builder.config("spark.sql.session.timeZone", 
"UTC").getOrCreate()

df = spark.createDataFrame([("usr1",17.00, "2018-02-10T15:27:18+00:00"),
("usr1",13.00, "2018-02-11T12:27:18+00:00"),
("usr1",25.00, "2018-02-12T11:27:18+00:00"),

[jira] [Updated] (SPARK-33863) Pyspark UDF wrongly changes timestamps to UTC

2021-01-07 Thread Nasir Ali (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nasir Ali updated SPARK-33863:
--
Description: 
*Problem*:

I have a dataframe with a ts (timestamp) column in UTC. If I create a new 
column using udf, pyspark udf wrongly changes timestamps into UTC time. ts 
(timestamp) column is already in UTC time. Therefore, pyspark udf should not 
convert ts (timestamp) column into UTC timestamp. 

I have used following configs to let spark know the timestamps are in UTC:

 
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType, TimestampType
import datetime

spark = SparkSession.builder.config("spark.sql.session.timeZone", 
"UTC").getOrCreate()

df = spark.createDataFrame([("usr1",17.00, "2018-02-10T15:27:18+00:00"),
("usr1",13.00, "2018-02-11T12:27:18+00:00"),
("usr1",25.00, "2018-02-12T11:27:18+00:00"),
("usr1",20.00, "2018-02-13T15:27:18+00:00"),
("usr1",17.00, "2018-02-14T12:27:18+00:00"),
("usr2",99.00, "2018-02-15T11:27:18+00:00"),
("usr2",156.00, "2018-02-22T11:27:18+00:00")
],
   ["user","id", "ts"])

df = df.withColumn('ts', df.ts.cast('timestamp'))
df.show(truncate=False)

def some_time_udf(i):
if  datetime.time(5, 0)<=i.time() < datetime.time(12, 0):
tmp= "Morning: " + str(i)
elif  datetime.time(12, 0)<=i.time() < datetime.time(17, 0):
tmp= str(i)
elif  datetime.time(17, 0)<=i.time() < datetime.time(21, 0):
tmp= "Evening"
elif  datetime.time(21, 0)<=i.time() < datetime.time(0, 0):
tmp= "Night"
elif  datetime.time(0, 0)<=i.time() < datetime.time(5, 0):
tmp= "Night"
return tmp

udf = F.udf(some_time_udf,StringType())
df.withColumn("day_part", udf(df.ts)).show(truncate=False)


{code}
 

Below is the output of the above code:
{code:java}
++-+---+---+
|user|id   |ts |day_part   |
++-+---+---+
|usr1|17.0 |2018-02-10 15:27:18|2018-02-10 09:27:18|
|usr1|13.0 |2018-02-11 12:27:18|2018-02-11 06:27:18|
|usr1|25.0 |2018-02-12 11:27:18|2018-02-12 05:27:18|
|usr1|20.0 |2018-02-13 15:27:18|2018-02-13 09:27:18|
|usr1|17.0 |2018-02-14 12:27:18|2018-02-14 06:27:18|
|usr2|99.0 |2018-02-15 11:27:18|2018-02-15 05:27:18|
|usr2|156.0|2018-02-22 11:27:18|2018-02-22 05:27:18|
++-+---+---+
{code}
Above output is incorrect. You can see ts and day_part columns don't have same 
timestamps. Below is the output I would expect:

 
{code:java}
++-+---+---+
|user|id   |ts |day_part   |
++-+---+---+
|usr1|17.0 |2018-02-10 15:27:18|2018-02-10 15:27:18|
|usr1|13.0 |2018-02-11 12:27:18|2018-02-11 12:27:18|
|usr1|25.0 |2018-02-12 11:27:18|2018-02-12 11:27:18|
|usr1|20.0 |2018-02-13 15:27:18|2018-02-13 15:27:18|
|usr1|17.0 |2018-02-14 12:27:18|2018-02-14 12:27:18|
|usr2|99.0 |2018-02-15 11:27:18|2018-02-15 11:27:18|
|usr2|156.0|2018-02-22 11:27:18|2018-02-22 11:27:18|
++-+---+---+{code}
If I change return type to TimeStampType then 'day_part' will have correct 
timestamp.

  was:
*Problem*:

I have a dataframe with a ts (timestamp) column in UTC. If I create a new 
column using udf, pyspark udf wrongly changes timestamps into UTC time. ts 
(timestamp) column is already in UTC time. Therefore, pyspark udf should not 
convert ts (timestamp) column into UTC timestamp. 

I have used following configs to let spark know the timestamps are in UTC:

 
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import datetime

spark = SparkSession.builder.config("spark.sql.session.timeZone", 
"UTC").getOrCreate()

df = spark.createDataFrame([("usr1",17.00, "2018-02-10T15:27:18+00:00"),
("usr1",13.00, "2018-02-11T12:27:18+00:00"),
("usr1",25.00, "2018-02-12T11:27:18+00:00"),
("usr1",20.00, "2018-02-13T15:27:18+00:00"),
("usr1",17.00, "2018-02-14T12:27:18+00:00"),
  

[jira] [Updated] (SPARK-33863) Pyspark UDF wrongly changes timestamps to UTC

2021-01-07 Thread Nasir Ali (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nasir Ali updated SPARK-33863:
--
Description: 
*Problem*:

I have a dataframe with a ts (timestamp) column in UTC. If I create a new 
column using udf, pyspark udf wrongly changes timestamps into UTC time. ts 
(timestamp) column is already in UTC time. Therefore, pyspark udf should not 
convert ts (timestamp) column into UTC timestamp. 

I have used following configs to let spark know the timestamps are in UTC:

 
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import datetime

spark = SparkSession.builder.config("spark.sql.session.timeZone", 
"UTC").getOrCreate()

df = spark.createDataFrame([("usr1",17.00, "2018-02-10T15:27:18+00:00"),
("usr1",13.00, "2018-02-11T12:27:18+00:00"),
("usr1",25.00, "2018-02-12T11:27:18+00:00"),
("usr1",20.00, "2018-02-13T15:27:18+00:00"),
("usr1",17.00, "2018-02-14T12:27:18+00:00"),
("usr2",99.00, "2018-02-15T11:27:18+00:00"),
("usr2",156.00, "2018-02-22T11:27:18+00:00")
],
   ["user","id", "ts"])
df = df.withColumn('ts', df.ts.cast('timestamp'))

df.show(truncate=False)

def some_time_udf(i):
return str(i)

udf = F.udf(some_time_udf,StringType())
df.withColumn("day_part", udf(df.ts)).show(truncate=False)

{code}
 

Below is the output of the above code:
{code:java}
++-+---+---+
|user|id   |ts |day_part   |
++-+---+---+
|usr1|17.0 |2018-02-10 15:27:18|2018-02-10 09:27:18|
|usr1|13.0 |2018-02-11 12:27:18|2018-02-11 06:27:18|
|usr1|25.0 |2018-02-12 11:27:18|2018-02-12 05:27:18|
|usr1|20.0 |2018-02-13 15:27:18|2018-02-13 09:27:18|
|usr1|17.0 |2018-02-14 12:27:18|2018-02-14 06:27:18|
|usr2|99.0 |2018-02-15 11:27:18|2018-02-15 05:27:18|
|usr2|156.0|2018-02-22 11:27:18|2018-02-22 05:27:18|
++-+---+---+
{code}
Above output is incorrect. You can see ts and day_part columns don't have same 
timestamps. Below is the output I would expect:

 
{code:java}
++-+---+---+
|user|id   |ts |day_part   |
++-+---+---+
|usr1|17.0 |2018-02-10 15:27:18|2018-02-10 15:27:18|
|usr1|13.0 |2018-02-11 12:27:18|2018-02-11 12:27:18|
|usr1|25.0 |2018-02-12 11:27:18|2018-02-12 11:27:18|
|usr1|20.0 |2018-02-13 15:27:18|2018-02-13 15:27:18|
|usr1|17.0 |2018-02-14 12:27:18|2018-02-14 12:27:18|
|usr2|99.0 |2018-02-15 11:27:18|2018-02-15 11:27:18|
|usr2|156.0|2018-02-22 11:27:18|2018-02-22 11:27:18|
++-+---+---+{code}
If I change return type to TimeStampType then 'day_part' will have correct 
timestamp.

  was:
*Problem*:

I have a dataframe with a ts (timestamp) column in UTC. If I create a new 
column using udf, pyspark udf wrongly changes timestamps into UTC time. ts 
(timestamp) column is already in UTC time. Therefore, pyspark udf should not 
convert ts (timestamp) column into UTC timestamp.

I have used following configs to let spark know the timestamps are in UTC:

 
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import datetime

spark = SparkSession.builder.config("spark.sql.session.timeZone", 
"UTC").getOrCreate()

df = spark.createDataFrame([("usr1",17.00, "2018-02-10T15:27:18+00:00"),
("usr1",13.00, "2018-02-11T12:27:18+00:00"),
("usr1",25.00, "2018-02-12T11:27:18+00:00"),
("usr1",20.00, "2018-02-13T15:27:18+00:00"),
("usr1",17.00, "2018-02-14T12:27:18+00:00"),
("usr2",99.00, "2018-02-15T11:27:18+00:00"),
("usr2",156.00, "2018-02-22T11:27:18+00:00")
],
   ["user","id", "ts"])
df = df.withColumn('ts', df.ts.cast('timestamp'))

df.show(truncate=False)

def some_time_udf(i):
return str(i)

udf = F.udf(some_time_udf,StringType())
df.withColumn("day_part", udf(df.ts)).show(truncate=False)

{code}
 

Below is the output of 

[jira] [Updated] (SPARK-33863) Pyspark UDF wrongly changes timestamps to UTC

2021-01-07 Thread Nasir Ali (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nasir Ali updated SPARK-33863:
--
Description: 
*Problem*:

I have a dataframe with a ts (timestamp) column in UTC. If I create a new 
column using udf, pyspark udf wrongly changes timestamps into UTC time. ts 
(timestamp) column is already in UTC time. Therefore, pyspark udf should not 
convert ts (timestamp) column into UTC timestamp.

I have used following configs to let spark know the timestamps are in UTC:

 
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import datetime

spark = SparkSession.builder.config("spark.sql.session.timeZone", 
"UTC").getOrCreate()

df = spark.createDataFrame([("usr1",17.00, "2018-02-10T15:27:18+00:00"),
("usr1",13.00, "2018-02-11T12:27:18+00:00"),
("usr1",25.00, "2018-02-12T11:27:18+00:00"),
("usr1",20.00, "2018-02-13T15:27:18+00:00"),
("usr1",17.00, "2018-02-14T12:27:18+00:00"),
("usr2",99.00, "2018-02-15T11:27:18+00:00"),
("usr2",156.00, "2018-02-22T11:27:18+00:00")
],
   ["user","id", "ts"])
df = df.withColumn('ts', df.ts.cast('timestamp'))

df.show(truncate=False)

def some_time_udf(i):
return str(i)

udf = F.udf(some_time_udf,StringType())
df.withColumn("day_part", udf(df.ts)).show(truncate=False)

{code}
 

Below is the output of the above code:
{code:java}
++-+---+---+
|user|id   |ts |day_part   |
++-+---+---+
|usr1|17.0 |2018-02-10 15:27:18|2018-02-10 09:27:18|
|usr1|13.0 |2018-02-11 12:27:18|2018-02-11 06:27:18|
|usr1|25.0 |2018-02-12 11:27:18|2018-02-12 05:27:18|
|usr1|20.0 |2018-02-13 15:27:18|2018-02-13 09:27:18|
|usr1|17.0 |2018-02-14 12:27:18|2018-02-14 06:27:18|
|usr2|99.0 |2018-02-15 11:27:18|2018-02-15 05:27:18|
|usr2|156.0|2018-02-22 11:27:18|2018-02-22 05:27:18|
++-+---+---+
{code}
Above output is incorrect. You can see ts and day_part columns don't have same 
timestamps. Below is the output I would expect:
{code:java}
++-+---+---+
|user|id   |ts |day_part   |
++-+---+---+
|usr1|17.0 |2018-02-10 15:27:18|2018-02-10 15:27:18|
|usr1|13.0 |2018-02-11 12:27:18|2018-02-11 12:27:18|
|usr1|25.0 |2018-02-12 11:27:18|2018-02-12 11:27:18|
|usr1|20.0 |2018-02-13 15:27:18|2018-02-13 15:27:18|
|usr1|17.0 |2018-02-14 12:27:18|2018-02-14 12:27:18|
|usr2|99.0 |2018-02-15 11:27:18|2018-02-15 11:27:18|
|usr2|156.0|2018-02-22 11:27:18|2018-02-22 11:27:18|
++-+---+---+
{code}

  was:
*Problem*:

I have a dataframe with a ts (timestamp) column in UTC. If I create a new 
column using udf, pyspark udf wrongly changes timestamps into UTC time. ts 
(timestamp) column is already in UTC time. Therefore, pyspark udf should not 
convert ts (timestamp) column into UTC timestamp.

I have used following configs to let spark know the timestamps are in UTC:

 
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import datetime
spark = SparkSession.builder.config("spark.sql.session.timeZone", 
"UTC").getOrCreate()df = spark.createDataFrame([("usr1",17.00, 
"2018-02-10T15:27:18+00:00"),
("usr1",13.00, "2018-02-11T12:27:18+00:00"),
("usr1",25.00, "2018-02-12T11:27:18+00:00"),
("usr1",20.00, "2018-02-13T15:27:18+00:00"),
("usr1",17.00, "2018-02-14T12:27:18+00:00"),
("usr2",99.00, "2018-02-15T11:27:18+00:00"),
("usr2",156.00, "2018-02-22T11:27:18+00:00")
],
   ["user","id", "ts"])
df = df.withColumn('ts', df.ts.cast('timestamp'))
df.show(truncate=False)def some_time_udf(i):
return str(i)udf = F.udf(some_time_udf,StringType())
df.withColumn("day_part", udf(df.ts)).show(truncate=False)

{code}
 

Below is the output of the above code:
{code:java}
++-+---+---+
|user|id   |ts 

[jira] [Commented] (SPARK-33863) Pyspark UDF wrongly changes timestamps to UTC

2021-01-07 Thread Nasir Ali (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260911#comment-17260911
 ] 

Nasir Ali commented on SPARK-33863:
---

[~hyukjin.kwon] and [~viirya] I have simplified example code, added/revised 
details, and also added output and expected output. Please let me know if you 
need more information.

> Pyspark UDF wrongly changes timestamps to UTC
> -
>
> Key: SPARK-33863
> URL: https://issues.apache.org/jira/browse/SPARK-33863
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
> Environment: MAC/Linux
> Standalone cluster / local machine
>Reporter: Nasir Ali
>Priority: Major
>
> *Problem*:
> I have a dataframe with a ts (timestamp) column in UTC. If I create a new 
> column using udf, pyspark udf wrongly changes timestamps into UTC time. ts 
> (timestamp) column is already in UTC time. Therefore, pyspark udf should not 
> convert ts (timestamp) column into UTC timestamp.
> I have used following configs to let spark know the timestamps are in UTC:
>  
> {code:java}
> --conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
> --conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
> --conf spark.sql.session.timeZone=UTC
> {code}
> Below is a code snippet to reproduce the error:
>  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as F
> from pyspark.sql.types import StringType
> import datetime
> spark = SparkSession.builder.config("spark.sql.session.timeZone", 
> "UTC").getOrCreate()df = spark.createDataFrame([("usr1",17.00, 
> "2018-02-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-02-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-02-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-02-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-02-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-02-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-02-22T11:27:18+00:00")
> ],
>["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.show(truncate=False)def some_time_udf(i):
> return str(i)udf = F.udf(some_time_udf,StringType())
> df.withColumn("day_part", udf(df.ts)).show(truncate=False)
> {code}
>  
> Below is the output of the above code:
> {code:java}
> ++-+---+---+
> |user|id   |ts |day_part   |
> ++-+---+---+
> |usr1|17.0 |2018-02-10 15:27:18|2018-02-10 09:27:18|
> |usr1|13.0 |2018-02-11 12:27:18|2018-02-11 06:27:18|
> |usr1|25.0 |2018-02-12 11:27:18|2018-02-12 05:27:18|
> |usr1|20.0 |2018-02-13 15:27:18|2018-02-13 09:27:18|
> |usr1|17.0 |2018-02-14 12:27:18|2018-02-14 06:27:18|
> |usr2|99.0 |2018-02-15 11:27:18|2018-02-15 05:27:18|
> |usr2|156.0|2018-02-22 11:27:18|2018-02-22 05:27:18|
> ++-+---+---+
> {code}
> Above output is incorrect. You can see ts and day_part columns don't have 
> same timestamps. Below is the output I would expect:
> {code:java}
> ++-+---+---+
> |user|id   |ts |day_part   |
> ++-+---+---+
> |usr1|17.0 |2018-02-10 15:27:18|2018-02-10 15:27:18|
> |usr1|13.0 |2018-02-11 12:27:18|2018-02-11 12:27:18|
> |usr1|25.0 |2018-02-12 11:27:18|2018-02-12 11:27:18|
> |usr1|20.0 |2018-02-13 15:27:18|2018-02-13 15:27:18|
> |usr1|17.0 |2018-02-14 12:27:18|2018-02-14 12:27:18|
> |usr2|99.0 |2018-02-15 11:27:18|2018-02-15 11:27:18|
> |usr2|156.0|2018-02-22 11:27:18|2018-02-22 11:27:18|
> ++-+---+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33863) Pyspark UDF wrongly changes timestamps to UTC

2021-01-07 Thread Nasir Ali (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nasir Ali updated SPARK-33863:
--
Description: 
*Problem*:

I have a dataframe with a ts (timestamp) column in UTC. If I create a new 
column using udf, pyspark udf wrongly changes timestamps into UTC time. ts 
(timestamp) column is already in UTC time. Therefore, pyspark udf should not 
convert ts (timestamp) column into UTC timestamp.

I have used following configs to let spark know the timestamps are in UTC:

 
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import datetime
spark = SparkSession.builder.config("spark.sql.session.timeZone", 
"UTC").getOrCreate()df = spark.createDataFrame([("usr1",17.00, 
"2018-02-10T15:27:18+00:00"),
("usr1",13.00, "2018-02-11T12:27:18+00:00"),
("usr1",25.00, "2018-02-12T11:27:18+00:00"),
("usr1",20.00, "2018-02-13T15:27:18+00:00"),
("usr1",17.00, "2018-02-14T12:27:18+00:00"),
("usr2",99.00, "2018-02-15T11:27:18+00:00"),
("usr2",156.00, "2018-02-22T11:27:18+00:00")
],
   ["user","id", "ts"])
df = df.withColumn('ts', df.ts.cast('timestamp'))
df.show(truncate=False)def some_time_udf(i):
return str(i)udf = F.udf(some_time_udf,StringType())
df.withColumn("day_part", udf(df.ts)).show(truncate=False)

{code}
 

Below is the output of the above code:
{code:java}
++-+---+---+
|user|id   |ts |day_part   |
++-+---+---+
|usr1|17.0 |2018-02-10 15:27:18|2018-02-10 09:27:18|
|usr1|13.0 |2018-02-11 12:27:18|2018-02-11 06:27:18|
|usr1|25.0 |2018-02-12 11:27:18|2018-02-12 05:27:18|
|usr1|20.0 |2018-02-13 15:27:18|2018-02-13 09:27:18|
|usr1|17.0 |2018-02-14 12:27:18|2018-02-14 06:27:18|
|usr2|99.0 |2018-02-15 11:27:18|2018-02-15 05:27:18|
|usr2|156.0|2018-02-22 11:27:18|2018-02-22 05:27:18|
++-+---+---+
{code}
Above output is incorrect. You can see ts and day_part columns don't have same 
timestamps. Below is the output I would expect:


{code:java}
++-+---+---+
|user|id   |ts |day_part   |
++-+---+---+
|usr1|17.0 |2018-02-10 15:27:18|2018-02-10 15:27:18|
|usr1|13.0 |2018-02-11 12:27:18|2018-02-11 12:27:18|
|usr1|25.0 |2018-02-12 11:27:18|2018-02-12 11:27:18|
|usr1|20.0 |2018-02-13 15:27:18|2018-02-13 15:27:18|
|usr1|17.0 |2018-02-14 12:27:18|2018-02-14 12:27:18|
|usr2|99.0 |2018-02-15 11:27:18|2018-02-15 11:27:18|
|usr2|156.0|2018-02-22 11:27:18|2018-02-22 11:27:18|
++-+---+---+
{code}

  was:
*Problem*:

If I create a new column using udf, pyspark udf changes timestamps into UTC 
time. I have used following configs to let spark know the timestamps are in UTC:

 
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import datetime
spark = SparkSession.builder.config("spark.sql.session.timeZone", 
"UTC").getOrCreate()df = spark.createDataFrame([("usr1",17.00, 
"2018-03-10T15:27:18+00:00"),
("usr1",13.00, "2018-03-11T12:27:18+00:00"),
("usr1",25.00, "2018-03-12T11:27:18+00:00"),
("usr1",20.00, "2018-03-13T15:27:18+00:00"),
("usr1",17.00, "2018-03-14T12:27:18+00:00"),
("usr2",99.00, "2018-03-15T11:27:18+00:00"),
("usr2",156.00, "2018-03-22T11:27:18+00:00"),
("usr2",17.00, "2018-03-31T11:27:18+00:00"),
("usr2",25.00, "2018-03-15T11:27:18+00:00"),
("usr2",25.00, "2018-03-16T11:27:18+00:00")
],
   ["user","id", "ts"])
df = df.withColumn('ts', df.ts.cast('timestamp'))
df.show(truncate=False)def some_time_udf(i):
tmp=""
if  datetime.time(5, 0)<=i.time() < datetime.time(12, 0):
tmp="Morning -> "+str(i)
return tmpudf = F.udf(some_time_udf,StringType())

df.withColumn("day_part", udf(df.ts)).show(truncate=False)


{code}
I 

[jira] [Updated] (SPARK-33863) Pyspark UDF changes timestamps to UTC

2021-01-07 Thread Nasir Ali (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nasir Ali updated SPARK-33863:
--
Description: 
*Problem*:

If I create a new column using udf, pyspark udf changes timestamps into UTC 
time. I have used following configs to let spark know the timestamps are in UTC:

 
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import datetime
spark = SparkSession.builder.config("spark.sql.session.timeZone", 
"UTC").getOrCreate()df = spark.createDataFrame([("usr1",17.00, 
"2018-03-10T15:27:18+00:00"),
("usr1",13.00, "2018-03-11T12:27:18+00:00"),
("usr1",25.00, "2018-03-12T11:27:18+00:00"),
("usr1",20.00, "2018-03-13T15:27:18+00:00"),
("usr1",17.00, "2018-03-14T12:27:18+00:00"),
("usr2",99.00, "2018-03-15T11:27:18+00:00"),
("usr2",156.00, "2018-03-22T11:27:18+00:00"),
("usr2",17.00, "2018-03-31T11:27:18+00:00"),
("usr2",25.00, "2018-03-15T11:27:18+00:00"),
("usr2",25.00, "2018-03-16T11:27:18+00:00")
],
   ["user","id", "ts"])
df = df.withColumn('ts', df.ts.cast('timestamp'))
df.show(truncate=False)def some_time_udf(i):
tmp=""
if  datetime.time(5, 0)<=i.time() < datetime.time(12, 0):
tmp="Morning -> "+str(i)
return tmpudf = F.udf(some_time_udf,StringType())

df.withColumn("day_part", udf(df.ts)).show(truncate=False)


{code}
I have concatenated timestamps with the string to show that pyspark pass 
timestamps as UTC.

  was:
*Problem*:

If I create a new column using udf, pyspark udf changes timestamps into UTC 
time. I have used following configs to let spark know the timestamps are in UTC:

 
{code:java}
--conf spark.driver.extraJavaOptions=-Duser.timezone=UTC 
--conf spark.executor.extraJavaOptions=-Duser.timezone=UTC
--conf spark.sql.session.timeZone=UTC
{code}
Below is a code snippet to reproduce the error:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import datetime
spark = SparkSession.builder.config("spark.sql.session.timeZone", 
"UTC").getOrCreate()

df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
("usr1",13.00, "2018-03-11T12:27:18+00:00"),
("usr1",25.00, "2018-03-12T11:27:18+00:00"),
("usr1",20.00, "2018-03-13T15:27:18+00:00"),
("usr1",17.00, "2018-03-14T12:27:18+00:00"),
("usr2",99.00, "2018-03-15T11:27:18+00:00"),
("usr2",156.00, "2018-03-22T11:27:18+00:00"),
("usr2",17.00, "2018-03-31T11:27:18+00:00"),
("usr2",25.00, "2018-03-15T11:27:18+00:00"),
("usr2",25.00, "2018-03-16T11:27:18+00:00")
],
   ["user","id", "ts"])
df = df.withColumn('ts', df.ts.cast('timestamp'))

df.show(truncate=False)

def some_time_udf(i):
tmp=""
if  datetime.time(5, 0)<=i.time() < datetime.time(12, 0):
tmp="Morning -> "+str(i)
elif  datetime.time(12, 0)<=i.time() < datetime.time(17, 0):
tmp= "Afternoon -> "+str(i)
elif  datetime.time(17, 0)<=i.time() < datetime.time(21, 0):
tmp= "Evening -> "+str(i)
elif  datetime.time(21, 0)<=i.time() < datetime.time(0, 0):
tmp= "Night -> "+str(i)
elif  datetime.time(0, 0)<=i.time() < datetime.time(5, 0):
tmp= "Night -> "+str(i)
return tmpsometimeudf = 
F.udf(some_time_udf,StringType())df.withColumn("day_part", 
sometimeudf("ts")).show(truncate=False)

{code}
I have concatenated timestamps with the string to show that pyspark pass 
timestamps as UTC.


> Pyspark UDF changes timestamps to UTC
> -
>
> Key: SPARK-33863
> URL: https://issues.apache.org/jira/browse/SPARK-33863
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
> Environment: MAC/Linux
> Standalone cluster / local machine
>Reporter: Nasir Ali
>Priority: Major
>
> *Problem*:
> If I create a new column using udf, pyspark udf changes timestamps into UTC 
> time. I have used following configs to let spark know the timestamps are in 
> UTC:
>  
> {code:java}
> --conf 

[jira] [Commented] (SPARK-33106) Fix sbt resolvers clash

2021-01-07 Thread Alexander Bessonov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260903#comment-17260903
 ] 

Alexander Bessonov commented on SPARK-33106:


That doesn't seem to fix the issue:
{code:java}build/sbt publishLocal{code} now seems to end with an error 
"Undefined resolver 'ivyLocal'".

> Fix sbt resolvers clash
> ---
>
> Key: SPARK-33106
> URL: https://issues.apache.org/jira/browse/SPARK-33106
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Assignee: Denis Pyshev
>Priority: Minor
> Fix For: 3.1.0
>
>
> During sbt upgrade from 0.13 to 1.x, exact resolvers list was used as is.
> That leads to local resolvers name clashing, which is observed as warning 
> from SBT:
> {code:java}
> [warn] Multiple resolvers having different access mechanism configured with 
> same name 'local'. To avoid conflict, Remove duplicate project resolvers 
> (`resolvers`) or rename publishing resolve
> r (`publishTo`).
> {code}
> This needs to be fixed to avoid potential errors and reduce log noise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34018) NPE in ExecutorPodsSnapshot

2021-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34018:
--
Parent: SPARK-33005
Issue Type: Sub-task  (was: Bug)

> NPE in ExecutorPodsSnapshot
> ---
>
> Key: SPARK-34018
> URL: https://issues.apache.org/jira/browse/SPARK-34018
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Blocker
> Fix For: 3.1.1
>
>
> Currently the test (finishedExecutorWithRunningSidecar in test utils depends 
> on nulls matching) passes because we match the null on the container state 
> string. To fix this label both the statuses and ensure the 
> ExecutorPodSnapshot starts with the default config to match.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34018) NPE in ExecutorPodsSnapshot

2021-01-07 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34018.
---
Fix Version/s: 3.1.1
   Resolution: Fixed

Issue resolved by pull request 31071
[https://github.com/apache/spark/pull/31071]

> NPE in ExecutorPodsSnapshot
> ---
>
> Key: SPARK-34018
> URL: https://issues.apache.org/jira/browse/SPARK-34018
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Blocker
> Fix For: 3.1.1
>
>
> Currently the test (finishedExecutorWithRunningSidecar in test utils depends 
> on nulls matching) passes because we match the null on the container state 
> string. To fix this label both the statuses and ensure the 
> ExecutorPodSnapshot starts with the default config to match.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34044) Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md

2021-01-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34044:
-
Fix Version/s: (was: 3.1.0)
   3.1.1

> Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md
> -
>
> Key: SPARK-34044
> URL: https://issues.apache.org/jira/browse/SPARK-34044
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34044) Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md

2021-01-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34044.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 31085
[https://github.com/apache/spark/pull/31085]

> Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md
> -
>
> Key: SPARK-34044
> URL: https://issues.apache.org/jira/browse/SPARK-34044
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34044) Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md

2021-01-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34044:


Assignee: Dongjoon Hyun

> Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md
> -
>
> Key: SPARK-34044
> URL: https://issues.apache.org/jira/browse/SPARK-34044
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34041) Miscellaneous cleanup for new PySpark documentation

2021-01-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34041:


Assignee: Hyukjin Kwon

> Miscellaneous cleanup for new PySpark documentation
> ---
>
> Key: SPARK-34041
> URL: https://issues.apache.org/jira/browse/SPARK-34041
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> 1. Add a link of quick start in PySpark docs into "Programming Guides" in 
> Spark main docs
> 2. ML MLlib -> MLlib (DataFrame-based)" and "MLlib (RDD-based)"
> 3. Mention MLlib user guide 
> (https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-docs/_site/ml-guide.html)
> 4. Mention other migration guides as well because PySpark can get affected by 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34041) Miscellaneous cleanup for new PySpark documentation

2021-01-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34041.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 31082
[https://github.com/apache/spark/pull/31082]

> Miscellaneous cleanup for new PySpark documentation
> ---
>
> Key: SPARK-34041
> URL: https://issues.apache.org/jira/browse/SPARK-34041
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> 1. Add a link of quick start in PySpark docs into "Programming Guides" in 
> Spark main docs
> 2. ML MLlib -> MLlib (DataFrame-based)" and "MLlib (RDD-based)"
> 3. Mention MLlib user guide 
> (https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-docs/_site/ml-guide.html)
> 4. Mention other migration guides as well because PySpark can get affected by 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34044) Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md

2021-01-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34044:


Assignee: (was: Apache Spark)

> Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md
> -
>
> Key: SPARK-34044
> URL: https://issues.apache.org/jira/browse/SPARK-34044
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34044) Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md

2021-01-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34044:


Assignee: Apache Spark

> Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md
> -
>
> Key: SPARK-34044
> URL: https://issues.apache.org/jira/browse/SPARK-34044
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34044) Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md

2021-01-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260866#comment-17260866
 ] 

Apache Spark commented on SPARK-34044:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/31085

> Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md
> -
>
> Key: SPARK-34044
> URL: https://issues.apache.org/jira/browse/SPARK-34044
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34044) Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md

2021-01-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260867#comment-17260867
 ] 

Apache Spark commented on SPARK-34044:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/31085

> Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md
> -
>
> Key: SPARK-34044
> URL: https://issues.apache.org/jira/browse/SPARK-34044
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34044) Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md

2021-01-07 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-34044:
-

 Summary: Add spark.sql.hive.metastore.jars.path to 
sql-data-sources-hive-tables.md
 Key: SPARK-34044
 URL: https://issues.apache.org/jira/browse/SPARK-34044
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.2.0, 3.1.1
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26399) Add new stage-level REST APIs and parameters

2021-01-07 Thread Ron Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Hu updated SPARK-26399:
---
Description: 
Add the peak values for the metrics to the stages REST API. Also add a new 
executorSummary REST API, which will return executor summary metrics for a 
specified stage:
{code:java}
curl http://:18080/api/v1/applicationsexecutorMetricsSummary{code}
Add parameters to the stages REST API to specify:
 * filtering for task status, and returning tasks that match (for example, 
FAILED tasks).
 * task metric quantiles, add adding the task summary if specified
 * executor metric quantiles, and adding the executor summary if specified

Note that the above description is too brief to be clear.  Ron Hu added the 
additional details to explain the use cases from the downstream products.  See 
the comments dated 1/07/2021 with a couple of sample json files.

  was:
Add the peak values for the metrics to the stages REST API. Also add a new 
executorSummary REST API, which will return executor summary metrics for a 
specified stage:
{code:java}
curl http://:18080/api/v1/applicationsexecutorMetricsSummary{code}
Add parameters to the stages REST API to specify:
 * filtering for task status, and returning tasks that match (for example, 
FAILED tasks).
 * task metric quantiles, add adding the task summary if specified
 * executor metric quantiles, and adding the executor summary if specified


> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
> Attachments: executorMetricsSummary.json, 
> lispark230_restapi_ex2_stages_failedTasks.json, 
> lispark230_restapi_ex2_stages_withSummaries.json, 
> stage_executorSummary_image1.png
>
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http:// server>:18080/api/v1/applicationsexecutorMetricsSummary{code}
> Add parameters to the stages REST API to specify:
>  * filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
>  * task metric quantiles, add adding the task summary if specified
>  * executor metric quantiles, and adding the executor summary if specified
> Note that the above description is too brief to be clear.  Ron Hu added the 
> additional details to explain the use cases from the downstream products.  
> See the comments dated 1/07/2021 with a couple of sample json files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26399) Add new stage-level REST APIs and parameters

2021-01-07 Thread Ron Hu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260856#comment-17260856
 ] 

Ron Hu commented on SPARK-26399:


The initial description of this jira has this statement:  "filtering for task 
status, and returning tasks that match (for example, FAILED tasks)"

To achieve the above statement, we need an new endpoint like this: 
/applications/[app-id]/stages?taskstatus=[FAILED|KILLED|SUCCESS]

If a user specifies /applications/[app-id]/stages?taskstatus=KILLED, then we 
generate a json file to contain all the killed task information from all the 
stages.  This way can help users monitor all the killed tasks.  For example, a 
Spark user enables speculation, he needs the information of all the killed 
tasks so that he can monitor the benefit/cost brought by speculation.

I attach a sample json file  [^lispark230_restapi_ex2_stages_failedTasks.json]  
which contains the failed tasks and the corresponding stages for reference.

> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
> Attachments: executorMetricsSummary.json, 
> lispark230_restapi_ex2_stages_failedTasks.json, 
> lispark230_restapi_ex2_stages_withSummaries.json, 
> stage_executorSummary_image1.png
>
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http:// server>:18080/api/v1/applicationsexecutorMetricsSummary{code}
> Add parameters to the stages REST API to specify:
>  * filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
>  * task metric quantiles, add adding the task summary if specified
>  * executor metric quantiles, and adding the executor summary if specified



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26399) Add new stage-level REST APIs and parameters

2021-01-07 Thread Ron Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Hu updated SPARK-26399:
---
Attachment: lispark230_restapi_ex2_stages_failedTasks.json

> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
> Attachments: executorMetricsSummary.json, 
> lispark230_restapi_ex2_stages_failedTasks.json, 
> lispark230_restapi_ex2_stages_withSummaries.json, 
> stage_executorSummary_image1.png
>
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http:// server>:18080/api/v1/applicationsexecutorMetricsSummary{code}
> Add parameters to the stages REST API to specify:
>  * filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
>  * task metric quantiles, add adding the task summary if specified
>  * executor metric quantiles, and adding the executor summary if specified



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26399) Add new stage-level REST APIs and parameters

2021-01-07 Thread Ron Hu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260854#comment-17260854
 ] 

Ron Hu commented on SPARK-26399:


[^lispark230_restapi_ex2_stages_failedTasks.json]

> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
> Attachments: executorMetricsSummary.json, 
> lispark230_restapi_ex2_stages_failedTasks.json, 
> lispark230_restapi_ex2_stages_withSummaries.json, 
> stage_executorSummary_image1.png
>
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http:// server>:18080/api/v1/applicationsexecutorMetricsSummary{code}
> Add parameters to the stages REST API to specify:
>  * filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
>  * task metric quantiles, add adding the task summary if specified
>  * executor metric quantiles, and adding the executor summary if specified



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33741) Add minimum threshold speculation config

2021-01-07 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260852#comment-17260852
 ] 

Thomas Graves commented on SPARK-33741:
---

probably dup of https://issues.apache.org/jira/browse/SPARK-29910

> Add minimum threshold speculation config
> 
>
> Key: SPARK-33741
> URL: https://issues.apache.org/jira/browse/SPARK-33741
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Sanket Reddy
>Priority: Minor
>
> When we turn on speculation with default configs we have the last 10% of the 
> tasks subject to speculation. There are a lot of stages where the stage runs 
> for few seconds to minutes. Also in general we don't want to speculate tasks 
> that run within a specific interval. By setting a minimum threshold for 
> speculation gives us better control



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26399) Add new stage-level REST APIs and parameters

2021-01-07 Thread Ron Hu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260805#comment-17260805
 ] 

Ron Hu edited comment on SPARK-26399 at 1/7/21, 10:23 PM:
--

Dr. Elephant ([https://github.com/linkedin/dr-elephant]) is a downstream open 
source product that utilizes Spark monitoring information so that it can advise 
Spark users where to optimize their configuration parameters ranging from 
memory usage, number of cores, etc.  Because the initial description of this 
ticket is too brief to be clear.  Let me explain the use cases for Dr. Elephant 
here. 

REST API /applications/[app-id]/stages: This useful endpoint generate a json 
file containing all stages for a given application.  The current Spark version 
already provides it.

In order to debug if there exists a skew issue, a downstream product also needs:
 - taskMetricsSummary: It includes task metric information such as 
executorRunTime, inputMetrics, outputMetrics,   shuffleReadMetrics, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the tasks in a 
given stage.  The same information shows up in Web UI for a specified stage.

 - executorMetricsSummary: It includes executor metrics information such as 
number of tasks, input bytes, peak JVM memory, peak execution memory, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the executors used 
in a given stage.  This information has been developed by [~angerszhuuu] in the 
PR he submitted.

We can add the above information to the the json file generated by 
/applications/[app-id]/stages. It may double the size of the stages endpoints 
file.  It should be fine because the current stages json file is not that big.  
Here is one sample json file for stages endpoint. 
[^lispark230_restapi_ex2_stages_withSummaries.json]

An alternative approach is to add a new REST API such as 
"/applications/[app-id]/stages/withSummaries".  But it may need a little bit 
more code for a new endpoint.


was (Author: ron8hu):
Dr. Elephant ([https://github.com/linkedin/dr-elephant]) is a downstream open 
source product that utilizes Spark monitoring information so that it can advise 
Spark users where to optimize their configuration parameters ranging from 
memory usage, number of cores, etc.  Because the initial description of this 
ticket is too brief to be clear.  Let me explain the use cases for Dr. Elephant 
here. 

REST API /applications/[app-id]/stages: This useful endpoint provides a list of 
all stages for a given application.  The current Spark version already provides 
it.

In order to debug if there exists a skew issue, a downstream product also needs:
 - taskMetricsSummary: It includes task metric information such as 
executorRunTime, inputMetrics, outputMetrics,   shuffleReadMetrics, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the tasks in a 
given stage.  The same information shows up in Web UI for a specified stage.

 - executorMetricsSummary: It includes executor metrics information such as 
number of tasks, input bytes, peak JVM memory, peak execution memory, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the executors used 
in a given stage.  This information has been developed by [~angerszhuuu] in the 
PR he submitted.

We can add the above information to the the json file generated by 
/applications/[app-id]/stages. It may double the size of the stages endpoints 
file.  It should be fine because the current stages json file is not that big.  
Here is one sample json file for stages endpoint. 
[^lispark230_restapi_ex2_stages_withSummaries.json]

An alternative approach is to add a new REST API such as 
"/applications/[app-id]/stages/withSummaries".  But it may need a little bit 
more code for a new endpoint.

> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
> Attachments: executorMetricsSummary.json, 
> lispark230_restapi_ex2_stages_withSummaries.json, 
> stage_executorSummary_image1.png
>
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http:// server>:18080/api/v1/applicationsexecutorMetricsSummary{code}
> Add parameters to the stages REST API to specify:
>  * filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
>  * task metric quantiles, add adding the task summary if specified
>  * executor metric quantiles, and adding the executor summary if specified



--

[jira] [Comment Edited] (SPARK-26399) Add new stage-level REST APIs and parameters

2021-01-07 Thread Ron Hu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260805#comment-17260805
 ] 

Ron Hu edited comment on SPARK-26399 at 1/7/21, 8:41 PM:
-

Dr. Elephant ([https://github.com/linkedin/dr-elephant]) is a downstream open 
source product that utilizes Spark monitoring information so that it can advise 
Spark users where to optimize their configuration parameters ranging from 
memory usage, number of cores, etc.  Because the initial description of this 
ticket is too brief to be clear.  Let me explain the use cases for Dr. Elephant 
here. 

REST API /applications/[app-id]/stages: This useful endpoint provides a list of 
all stages for a given application.  The current Spark version already provides 
it.

In order to debug if there exists a skew issue, a downstream product also needs:
 - taskMetricsSummary: It includes task metric information such as 
executorRunTime, inputMetrics, outputMetrics,   shuffleReadMetrics, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the tasks in a 
given stage.  The same information shows up in Web UI for a specified stage.

 - executorMetricsSummary: It includes executor metrics information such as 
number of tasks, input bytes, peak JVM memory, peak execution memory, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the executors used 
in a given stage.  This information has been developed by [~angerszhuuu] in the 
PR he submitted.

We can add the above information to the the json file generated by 
/applications/[app-id]/stages. It may double the size of the stages endpoints 
file.  It should be fine because the current stages json file is not that big.  
Here is one sample json file for stages endpoint. 
[^lispark230_restapi_ex2_stages_withSummaries.json]

An alternative approach is to add a new REST API such as 
"/applications/[app-id]/stages/withSummaries".  But it may need a little bit 
more code for a new endpoint.


was (Author: ron8hu):
Dr. Elephant ([https://github.com/linkedin/dr-elephant]) is a downstream open 
source product that utilizes Spark monitoring information so that it can advise 
Spark users where to optimize their configuration parameters ranging from 
memory usage, number of cores, etc.  Because the initial description of this 
ticket is too brief to be clear.  Let me explain the use cases for Dr. Elephant 
here. 

REST API /applications/[app-id]/stages: This useful endpoint provides a list of 
all stages for a given application.  The current Spark version already provides 
it.

In order to debug if there exists a skew issue, a downstream product also needs:
 - taskMetricsSummary: It includes task metric information such as 
executorRunTime, inputMetrics, outputMetrics,   shuffleReadMetrics, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the tasks in a 
given stage.  The same information shows up in Web UI for a specified stage.

 - executorMetricsSummary: It includes executor metrics information such as 
number of tasks, input bytes, peak JVM memory, peak execution memory, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the executors used 
in a given stage.  This information has been developed by [~angerszhuuu] in the 
PR he submitted.

We can add the above information to the the json file generated by 
/applications/[app-id]/stages. It may double the size of the stages endpoints 
file.  It should be fine because the current stages json file is not that big.  
Here is one sample json file "lispark230_restapi_ex2_stages_withSummaries.json" 
for stages endpoint. [^lispark230_restapi_ex2_stages_withSummaries.json]

An alternative approach is to add a new REST API such as 
"/applications/[app-id]/stages/withSummaries".  But it may need a little bit 
more code for a new endpoint.

> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
> Attachments: executorMetricsSummary.json, 
> lispark230_restapi_ex2_stages_withSummaries.json, 
> stage_executorSummary_image1.png
>
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http:// server>:18080/api/v1/applicationsexecutorMetricsSummary{code}
> Add parameters to the stages REST API to specify:
>  * filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
>  * task metric quantiles, add adding the task summary if specified
>  * executor metric quantiles, and adding the 

[jira] [Comment Edited] (SPARK-26399) Add new stage-level REST APIs and parameters

2021-01-07 Thread Ron Hu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260805#comment-17260805
 ] 

Ron Hu edited comment on SPARK-26399 at 1/7/21, 8:40 PM:
-

Dr. Elephant ([https://github.com/linkedin/dr-elephant]) is a downstream open 
source product that utilizes Spark monitoring information so that it can advise 
Spark users where to optimize their configuration parameters ranging from 
memory usage, number of cores, etc.  Because the initial description of this 
ticket is too brief to be clear.  Let me explain the use cases for Dr. Elephant 
here. 

REST API /applications/[app-id]/stages: This useful endpoint provides a list of 
all stages for a given application.  The current Spark version already provides 
it.

In order to debug if there exists a skew issue, a downstream product also needs:
 - taskMetricsSummary: It includes task metric information such as 
executorRunTime, inputMetrics, outputMetrics,   shuffleReadMetrics, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the tasks in a 
given stage.  The same information shows up in Web UI for a specified stage.

 - executorMetricsSummary: It includes executor metrics information such as 
number of tasks, input bytes, peak JVM memory, peak execution memory, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the executors used 
in a given stage.  This information has been developed by [~angerszhuuu] in the 
PR he submitted.

We can add the above information to the the json file generated by 
/applications/[app-id]/stages. It may double the size of the stages endpoints 
file.  It should be fine because the current stages json file is not that big.  
Here is one sample json file "lispark230_restapi_ex2_stages_withSummaries.json" 
for stages endpoint. [^lispark230_restapi_ex2_stages_withSummaries.json]

An alternative approach is to add a new REST API such as 
"/applications/[app-id]/stages/withSummaries".  But it may need a little bit 
more code for a new endpoint.


was (Author: ron8hu):
Dr. Elephant (https://github.com/linkedin/dr-elephant) is a downstream open 
source product that utilizes Spark monitoring information so that it can advise 
Spark users where to optimize their configuration parameters ranging from 
memory usage, number of cores, etc.  Because the initial description of this 
ticket is too brief to be clear.  Let me explain the use cases for Dr. Elephant 
here. 

REST API /applications/[app-id]/stages: This useful endpoint provides a list of 
all stages for a given application.  The current Spark version already provides 
it.

In order to debug if there exists a skew issue, a downstream product also needs:

- taskMetricsSummary: It includes task metric information such as 
executorRunTime, inputMetrics, outputMetrics,   shuffleReadMetrics, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the tasks in a 
given stage.  The same information shows up in Web UI for a specified stage.

- executorMetricsSummary: It includes executor metrics information such as 
number of tasks, input bytes, peak JVM memory, peak execution memory, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the executors used 
in a given stage.  This information has been developed by [~angerszhuuu] in the 
PR he submitted.

We can add the above information to the the json file generated by 
/applications/[app-id]/stages. It may double the size of the stages endpoints 
file.  It should be fine because the current stages json file is not that big.  
Here is one sample json file "lispark230_restapi_ex2_stages_withSummaries.json" 
for stages endpoint. 

An alternative approach is to add a new REST API such as 
"/applications/[app-id]/stages/withSummaries".  But it may need a little bit 
more code for a new endpoint.

> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
> Attachments: executorMetricsSummary.json, 
> lispark230_restapi_ex2_stages_withSummaries.json, 
> stage_executorSummary_image1.png
>
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http:// server>:18080/api/v1/applicationsexecutorMetricsSummary{code}
> Add parameters to the stages REST API to specify:
>  * filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
>  * task metric quantiles, add adding the task summary if specified
>  * executor metric quantiles, and adding the 

[jira] [Updated] (SPARK-26399) Add new stage-level REST APIs and parameters

2021-01-07 Thread Ron Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Hu updated SPARK-26399:
---
Attachment: lispark230_restapi_ex2_stages_withSummaries.json

> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
> Attachments: executorMetricsSummary.json, 
> lispark230_restapi_ex2_stages_withSummaries.json, 
> stage_executorSummary_image1.png
>
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http:// server>:18080/api/v1/applicationsexecutorMetricsSummary{code}
> Add parameters to the stages REST API to specify:
>  * filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
>  * task metric quantiles, add adding the task summary if specified
>  * executor metric quantiles, and adding the executor summary if specified



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26399) Add new stage-level REST APIs and parameters

2021-01-07 Thread Ron Hu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260808#comment-17260808
 ] 

Ron Hu commented on SPARK-26399:


[^lispark230_restapi_ex2_stages_withSummaries.json]

> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
> Attachments: executorMetricsSummary.json, 
> stage_executorSummary_image1.png
>
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http:// server>:18080/api/v1/applicationsexecutorMetricsSummary{code}
> Add parameters to the stages REST API to specify:
>  * filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
>  * task metric quantiles, add adding the task summary if specified
>  * executor metric quantiles, and adding the executor summary if specified



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26399) Add new stage-level REST APIs and parameters

2021-01-07 Thread Ron Hu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260805#comment-17260805
 ] 

Ron Hu commented on SPARK-26399:


Dr. Elephant (https://github.com/linkedin/dr-elephant) is a downstream open 
source product that utilizes Spark monitoring information so that it can advise 
Spark users where to optimize their configuration parameters ranging from 
memory usage, number of cores, etc.  Because the initial description of this 
ticket is too brief to be clear.  Let me explain the use cases for Dr. Elephant 
here. 

REST API /applications/[app-id]/stages: This useful endpoint provides a list of 
all stages for a given application.  The current Spark version already provides 
it.

In order to debug if there exists a skew issue, a downstream product also needs:

- taskMetricsSummary: It includes task metric information such as 
executorRunTime, inputMetrics, outputMetrics,   shuffleReadMetrics, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the tasks in a 
given stage.  The same information shows up in Web UI for a specified stage.

- executorMetricsSummary: It includes executor metrics information such as 
number of tasks, input bytes, peak JVM memory, peak execution memory, etc.  All 
in quantile distribution (0.0, 0.25, 0.5, 0.75, 1.0) for all the executors used 
in a given stage.  This information has been developed by [~angerszhuuu] in the 
PR he submitted.

We can add the above information to the the json file generated by 
/applications/[app-id]/stages. It may double the size of the stages endpoints 
file.  It should be fine because the current stages json file is not that big.  
Here is one sample json file "lispark230_restapi_ex2_stages_withSummaries.json" 
for stages endpoint. 

An alternative approach is to add a new REST API such as 
"/applications/[app-id]/stages/withSummaries".  But it may need a little bit 
more code for a new endpoint.

> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
> Attachments: executorMetricsSummary.json, 
> stage_executorSummary_image1.png
>
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http:// server>:18080/api/v1/applicationsexecutorMetricsSummary{code}
> Add parameters to the stages REST API to specify:
>  * filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
>  * task metric quantiles, add adding the task summary if specified
>  * executor metric quantiles, and adding the executor summary if specified



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33661) Unable to load RandomForestClassificationModel trained in Spark 2.x

2021-01-07 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260789#comment-17260789
 ] 

Sean R. Owen commented on SPARK-33661:
--

BTW I think this was actually fixed in 
https://issues.apache.org/jira/browse/SPARK-33398

> Unable to load RandomForestClassificationModel trained in Spark 2.x
> ---
>
> Key: SPARK-33661
> URL: https://issues.apache.org/jira/browse/SPARK-33661
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.1
>Reporter: Marcus Levine
>Priority: Major
>
> When attempting to load a RandomForestClassificationModel that was trained in 
> Spark 2.x using Spark 3.x, an exception is raised:
> {code:python}
> ...
> RandomForestClassificationModel.load('/path/to/my/model')
>   File "/usr/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 330, in 
> load
>   File "/usr/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 291, 
> in load
>   File "/usr/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 280, in 
> load
>   File "/usr/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
> 1305, in __call__
>   File "/usr/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in 
> deco
>   File "", line 3, in raise_from
> pyspark.sql.utils.AnalysisException: No such struct field rawCount in id, 
> prediction, impurity, impurityStats, gain, leftChild, rightChild, split;
> {code}
> There seems to be a schema incompatibility between the trained model data 
> saved by Spark 2.x and the expected data for a model trained in Spark 3.x
> If this issue is not resolved, users will be forced to retrain any existing 
> random forest models they trained in Spark 2.x using Spark 3.x before they 
> can upgrade



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-33661) Unable to load RandomForestClassificationModel trained in Spark 2.x

2021-01-07 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reopened SPARK-33661:
--

> Unable to load RandomForestClassificationModel trained in Spark 2.x
> ---
>
> Key: SPARK-33661
> URL: https://issues.apache.org/jira/browse/SPARK-33661
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.1
>Reporter: Marcus Levine
>Priority: Major
>
> When attempting to load a RandomForestClassificationModel that was trained in 
> Spark 2.x using Spark 3.x, an exception is raised:
> {code:python}
> ...
> RandomForestClassificationModel.load('/path/to/my/model')
>   File "/usr/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 330, in 
> load
>   File "/usr/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 291, 
> in load
>   File "/usr/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 280, in 
> load
>   File "/usr/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
> 1305, in __call__
>   File "/usr/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in 
> deco
>   File "", line 3, in raise_from
> pyspark.sql.utils.AnalysisException: No such struct field rawCount in id, 
> prediction, impurity, impurityStats, gain, leftChild, rightChild, split;
> {code}
> There seems to be a schema incompatibility between the trained model data 
> saved by Spark 2.x and the expected data for a model trained in Spark 3.x
> If this issue is not resolved, users will be forced to retrain any existing 
> random forest models they trained in Spark 2.x using Spark 3.x before they 
> can upgrade



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33661) Unable to load RandomForestClassificationModel trained in Spark 2.x

2021-01-07 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-33661.
--
Resolution: Duplicate

> Unable to load RandomForestClassificationModel trained in Spark 2.x
> ---
>
> Key: SPARK-33661
> URL: https://issues.apache.org/jira/browse/SPARK-33661
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.1
>Reporter: Marcus Levine
>Priority: Major
>
> When attempting to load a RandomForestClassificationModel that was trained in 
> Spark 2.x using Spark 3.x, an exception is raised:
> {code:python}
> ...
> RandomForestClassificationModel.load('/path/to/my/model')
>   File "/usr/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 330, in 
> load
>   File "/usr/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 291, 
> in load
>   File "/usr/spark/python/lib/pyspark.zip/pyspark/ml/util.py", line 280, in 
> load
>   File "/usr/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 
> 1305, in __call__
>   File "/usr/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 134, in 
> deco
>   File "", line 3, in raise_from
> pyspark.sql.utils.AnalysisException: No such struct field rawCount in id, 
> prediction, impurity, impurityStats, gain, leftChild, rightChild, split;
> {code}
> There seems to be a schema incompatibility between the trained model data 
> saved by Spark 2.x and the expected data for a model trained in Spark 3.x
> If this issue is not resolved, users will be forced to retrain any existing 
> random forest models they trained in Spark 2.x using Spark 3.x before they 
> can upgrade



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34043) Linear Discriminant Analysis for dimensionality reduction

2021-01-07 Thread Jose Llorens (Jira)
Jose Llorens created SPARK-34043:


 Summary: Linear Discriminant Analysis for dimensionality reduction
 Key: SPARK-34043
 URL: https://issues.apache.org/jira/browse/SPARK-34043
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 3.0.1
Reporter: Jose Llorens


The idea is to implement Linear discriminant analysis for dimensionality 
reduction. The algorithm is similar to PCA but it uses supervised 
classification to maximize class separation.  The API would be similar to the 
PCA one.

 

Other frameworks implement LDA, like sklearn 
([https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html)|http://example.com/]

 

LDA is a well-known algorithm with related literature, for example:

Izenman, Alan Julian. "Linear discriminant analysis." _Modern multivariate 
statistical techniques_. Springer, New York, NY, 2013. 237-280.

 

I would like to work on this issue, please let me know if it is interesting to 
add this feature to spark.

 

Thank you in advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33995) Make datetime addition easier for years, weeks, hours, minutes, and seconds

2021-01-07 Thread Matthew Powers (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Powers updated SPARK-33995:
---
Affects Version/s: (was: 3.1.0)
   3.2.0

> Make datetime addition easier for years, weeks, hours, minutes, and seconds
> ---
>
> Key: SPARK-33995
> URL: https://issues.apache.org/jira/browse/SPARK-33995
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Matthew Powers
>Priority: Minor
>
> There are add_months and date_add functions that make it easy to perform 
> datetime addition with months and days, but there isn't an easy way to 
> perform datetime addition with years, weeks, hours, minutes, or seconds with 
> the Scala/Python/R APIs.
> Users need to write code like expr("first_datetime + INTERVAL 2 hours") to 
> add two hours to a timestamp with the Scala API, which isn't desirable.  We 
> don't want to make Scala users manipulate SQL strings.
> We can expose the [make_interval SQL 
> function|https://github.com/apache/spark/pull/26446/files] to make any 
> combination of datetime addition possible.  That'll make tons of different 
> datetime addition operations possible and will be valuable for a wide array 
> of users.
> make_interval takes 7 arguments: years, months, weeks, days, hours, mins, and 
> secs.
> There are different ways to expose the make_interval functionality to 
> Scala/Python/R users:
>  * Option 1: Single make_interval function that takes 7 arguments
>  * Option 2: expose a few interval functions
>  ** make_date_interval function that takes years, months, days
>  ** make_time_interval function that takes hours, minutes, seconds
>  ** make_datetime_interval function that takes years, months, days, hours, 
> minutes, seconds
>  * Option 3: expose add_years, add_months, add_days, add_weeks, add_hours, 
> add_minutes, and add_seconds as Column methods.  
>  * Option 4: Expose the add_years, add_hours, etc. as column functions.  
> add_weeks and date_add have already been exposed in this manner.  
> Option 1 is nice from a maintenance perspective cause it's a single function, 
> but it's not standard from a user perspective.  Most languages support 
> datetime instantiation with these arguments: years, months, days, hours, 
> minutes, seconds.  Mixing weeks into the equation is not standard.
> As a user, Option 3 would be my preference.  
> col("first_datetime").addHours(2).addSeconds(30) is easy for me to remember 
> and type.  col("first_datetime") + make_time_interval(lit(2), lit(0), 
> lit(30)) isn't as nice.  col("first_datetime") + make_interval(lit(0), 
> lit(0), lit(0), lit(0), lit(2), lit(0), lit(30)) is harder still.
> Any of these options is an improvement to the status quo.  Let me know what 
> option you think is best and then I'll make a PR to implement it, building 
> off of Max's foundational work of course ;)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE

2021-01-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260562#comment-17260562
 ] 

Apache Spark commented on SPARK-33933:
--

User 'zhongyu09' has created a pull request for this issue:
https://github.com/apache/spark/pull/31084

> Broadcast timeout happened unexpectedly in AQE 
> ---
>
> Key: SPARK-33933
> URL: https://issues.apache.org/jira/browse/SPARK-33933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yu Zhong
>Assignee: Yu Zhong
>Priority: Major
> Fix For: 3.1.1
>
>
> In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal 
> queries as below.
>  
> {code:java}
> Could not execute broadcast in 300 secs. You can increase the timeout for 
> broadcasts via spark.sql.broadcastTimeout or disable broadcast join by 
> setting spark.sql.autoBroadcastJoinThreshold to -1
> {code}
>  
> This is usually happens when broadcast join(with or without hint) after a 
> long running shuffle (more than 5 minutes).  By disable AQE, the issues 
> disappear.
> The workaround is to increase spark.sql.broadcastTimeout and it works. But 
> because the data to broadcast is very small, that doesn't make sense.
> After investigation, the root cause should be like this: when enable AQE, in 
> getFinalPhysicalPlan, spark traversal the physical plan bottom up and create 
> query stage for materialized part by createQueryStages and materialize those 
> new created query stages to submit map stages or broadcasting. When 
> ShuffleQueryStage are materializing before BroadcastQueryStage, the map job 
> and broadcast job are submitted almost at the same time, but map job will 
> hold all the computing resources. If the map job runs slow (when lots of data 
> needs to process and the resource is limited), the broadcast job cannot be 
> started(and finished) before spark.sql.broadcastTimeout, thus cause whole job 
> failed (introduced in SPARK-31475).
> Code to reproduce:
>  
> {code:java}
> import java.util.UUID
> import scala.util.Random
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SparkSession
> val spark = SparkSession.builder()
>   .master("local[2]")
>   .appName("Test Broadcast").getOrCreate()
> import spark.implicits._
> spark.conf.set("spark.sql.adaptive.enabled", "true")
> val sc = spark.sparkContext
> sc.setLogLevel("INFO")
> val uuid = UUID.randomUUID
> val df = sc.parallelize(Range(0, 1), 1).flatMap(x => {
>   for (i <- Range(0, 1 + Random.nextInt(1)))
> yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString)
> }).toDF("index", "part", "pv", "uuid")
>   .withColumn("md5", md5($"uuid"))
> val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x))
> val dim = dim_data.toDF("name", "index")
> val result = df.groupBy("index")
>   .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv"))
>   .join(dim, Seq("index"))
>   .collect(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession

2021-01-07 Thread Vinoo Ganesh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260558#comment-17260558
 ] 

Vinoo Ganesh commented on SPARK-32165:
--

[~tgraves] - I just added in a description from the comment into the jira

> SessionState leaks SparkListener with multiple SparkSession
> ---
>
> Key: SPARK-32165
> URL: https://issues.apache.org/jira/browse/SPARK-32165
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Priority: Major
>
> Copied from 
> [https://github.com/apache/spark/pull/28128#issuecomment-653102770]
> I'd like to point out that this pr 
> (https://github.com/apache/spark/pull/28128) doesn't fix the memory leaky 
> completely. Once {{SessionState}} is touched, it will add two more listeners 
> into the SparkContext, namely {{SQLAppStatusListener}} and 
> {{ExecutionListenerBus}}
> It can be reproduced easily as
> {code:java}
>   test("SPARK-31354: SparkContext only register one SparkSession 
> ApplicationEnd listener") {
> val conf = new SparkConf()
>   .setMaster("local")
>   .setAppName("test-app-SPARK-31354-1")
> val context = new SparkContext(conf)
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postFirstCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postSecondCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> assert(postFirstCreation == postSecondCreation)
>   }
> {code}
> The problem can be reproduced by the above code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession

2021-01-07 Thread Vinoo Ganesh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoo Ganesh updated SPARK-32165:
-
Description: 
Copied from [https://github.com/apache/spark/pull/28128#issuecomment-653102770]

I'd like to point out that this pr (https://github.com/apache/spark/pull/28128) 
doesn't fix the memory leaky completely. Once {{SessionState}} is touched, it 
will add two more listeners into the SparkContext, namely 
{{SQLAppStatusListener}} and {{ExecutionListenerBus}}

It can be reproduced easily as
{code:java}
  test("SPARK-31354: SparkContext only register one SparkSession ApplicationEnd 
listener") {
val conf = new SparkConf()
  .setMaster("local")
  .setAppName("test-app-SPARK-31354-1")
val context = new SparkContext(conf)
SparkSession
  .builder()
  .sparkContext(context)
  .master("local")
  .getOrCreate()
  .sessionState // this touches the sessionState
val postFirstCreation = context.listenerBus.listeners.size()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()

SparkSession
  .builder()
  .sparkContext(context)
  .master("local")
  .getOrCreate()
  .sessionState // this touches the sessionState
val postSecondCreation = context.listenerBus.listeners.size()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
assert(postFirstCreation == postSecondCreation)
  }
{code}
The problem can be reproduced by the above code.

  was:
Copied from [https://github.com/apache/spark/pull/28128#issuecomment-653102770]

 I'd like to point out that this pr doesn't fix the memory leaky completely. 
Once {{SessionState}} is touched, it will add two more listeners into the 
SparkContext, namely {{SQLAppStatusListener}} and {{ExecutionListenerBus}}

It can be reproduced easily as
{code:java}
  test("SPARK-31354: SparkContext only register one SparkSession ApplicationEnd 
listener") {
val conf = new SparkConf()
  .setMaster("local")
  .setAppName("test-app-SPARK-31354-1")
val context = new SparkContext(conf)
SparkSession
  .builder()
  .sparkContext(context)
  .master("local")
  .getOrCreate()
  .sessionState // this touches the sessionState
val postFirstCreation = context.listenerBus.listeners.size()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()

SparkSession
  .builder()
  .sparkContext(context)
  .master("local")
  .getOrCreate()
  .sessionState // this touches the sessionState
val postSecondCreation = context.listenerBus.listeners.size()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
assert(postFirstCreation == postSecondCreation)
  }
{code}
The problem can be reproduced by the above code.


> SessionState leaks SparkListener with multiple SparkSession
> ---
>
> Key: SPARK-32165
> URL: https://issues.apache.org/jira/browse/SPARK-32165
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Priority: Major
>
> Copied from 
> [https://github.com/apache/spark/pull/28128#issuecomment-653102770]
> I'd like to point out that this pr 
> (https://github.com/apache/spark/pull/28128) doesn't fix the memory leaky 
> completely. Once {{SessionState}} is touched, it will add two more listeners 
> into the SparkContext, namely {{SQLAppStatusListener}} and 
> {{ExecutionListenerBus}}
> It can be reproduced easily as
> {code:java}
>   test("SPARK-31354: SparkContext only register one SparkSession 
> ApplicationEnd listener") {
> val conf = new SparkConf()
>   .setMaster("local")
>   .setAppName("test-app-SPARK-31354-1")
> val context = new SparkContext(conf)
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postFirstCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postSecondCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> assert(postFirstCreation == postSecondCreation)
>   }
> {code}
> The problem can be reproduced by the above code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32165) SessionState leaks SparkListener with multiple SparkSession

2021-01-07 Thread Vinoo Ganesh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoo Ganesh updated SPARK-32165:
-
Description: 
Copied from [https://github.com/apache/spark/pull/28128#issuecomment-653102770]

 I'd like to point out that this pr doesn't fix the memory leaky completely. 
Once {{SessionState}} is touched, it will add two more listeners into the 
SparkContext, namely {{SQLAppStatusListener}} and {{ExecutionListenerBus}}

It can be reproduced easily as
{code:java}
  test("SPARK-31354: SparkContext only register one SparkSession ApplicationEnd 
listener") {
val conf = new SparkConf()
  .setMaster("local")
  .setAppName("test-app-SPARK-31354-1")
val context = new SparkContext(conf)
SparkSession
  .builder()
  .sparkContext(context)
  .master("local")
  .getOrCreate()
  .sessionState // this touches the sessionState
val postFirstCreation = context.listenerBus.listeners.size()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()

SparkSession
  .builder()
  .sparkContext(context)
  .master("local")
  .getOrCreate()
  .sessionState // this touches the sessionState
val postSecondCreation = context.listenerBus.listeners.size()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
assert(postFirstCreation == postSecondCreation)
  }
{code}
The problem can be reproduced by the above code.

  was:
Copied from [https://github.com/apache/spark/pull/28128#issuecomment-653102770]

 
{code:java}
  test("SPARK-31354: SparkContext only register one SparkSession ApplicationEnd 
listener") {
val conf = new SparkConf()
  .setMaster("local")
  .setAppName("test-app-SPARK-31354-1")
val context = new SparkContext(conf)
SparkSession
  .builder()
  .sparkContext(context)
  .master("local")
  .getOrCreate()
  .sessionState // this touches the sessionState
val postFirstCreation = context.listenerBus.listeners.size()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()

SparkSession
  .builder()
  .sparkContext(context)
  .master("local")
  .getOrCreate()
  .sessionState // this touches the sessionState
val postSecondCreation = context.listenerBus.listeners.size()
SparkSession.clearActiveSession()
SparkSession.clearDefaultSession()
assert(postFirstCreation == postSecondCreation)
  }
{code}
The problem can be reproduced by the above code.


> SessionState leaks SparkListener with multiple SparkSession
> ---
>
> Key: SPARK-32165
> URL: https://issues.apache.org/jira/browse/SPARK-32165
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianjin YE
>Priority: Major
>
> Copied from 
> [https://github.com/apache/spark/pull/28128#issuecomment-653102770]
>  I'd like to point out that this pr doesn't fix the memory leaky completely. 
> Once {{SessionState}} is touched, it will add two more listeners into the 
> SparkContext, namely {{SQLAppStatusListener}} and {{ExecutionListenerBus}}
> It can be reproduced easily as
> {code:java}
>   test("SPARK-31354: SparkContext only register one SparkSession 
> ApplicationEnd listener") {
> val conf = new SparkConf()
>   .setMaster("local")
>   .setAppName("test-app-SPARK-31354-1")
> val context = new SparkContext(conf)
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postFirstCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> SparkSession
>   .builder()
>   .sparkContext(context)
>   .master("local")
>   .getOrCreate()
>   .sessionState // this touches the sessionState
> val postSecondCreation = context.listenerBus.listeners.size()
> SparkSession.clearActiveSession()
> SparkSession.clearDefaultSession()
> assert(postFirstCreation == postSecondCreation)
>   }
> {code}
> The problem can be reproduced by the above code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34026) DataSource V2: Inject repartition and sort nodes to satisfy required distribution and ordering

2021-01-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34026:


Assignee: Apache Spark

> DataSource V2: Inject repartition and sort nodes to satisfy required 
> distribution and ordering
> --
>
> Key: SPARK-34026
> URL: https://issues.apache.org/jira/browse/SPARK-34026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Assignee: Apache Spark
>Priority: Major
>
> As the next step towards requesting distribution and ordering on write, we 
> should inject appropriate repartition and sort nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34026) DataSource V2: Inject repartition and sort nodes to satisfy required distribution and ordering

2021-01-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260399#comment-17260399
 ] 

Apache Spark commented on SPARK-34026:
--

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/31083

> DataSource V2: Inject repartition and sort nodes to satisfy required 
> distribution and ordering
> --
>
> Key: SPARK-34026
> URL: https://issues.apache.org/jira/browse/SPARK-34026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> As the next step towards requesting distribution and ordering on write, we 
> should inject appropriate repartition and sort nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34026) DataSource V2: Inject repartition and sort nodes to satisfy required distribution and ordering

2021-01-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260398#comment-17260398
 ] 

Apache Spark commented on SPARK-34026:
--

User 'aokolnychyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/31083

> DataSource V2: Inject repartition and sort nodes to satisfy required 
> distribution and ordering
> --
>
> Key: SPARK-34026
> URL: https://issues.apache.org/jira/browse/SPARK-34026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> As the next step towards requesting distribution and ordering on write, we 
> should inject appropriate repartition and sort nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34026) DataSource V2: Inject repartition and sort nodes to satisfy required distribution and ordering

2021-01-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34026:


Assignee: (was: Apache Spark)

> DataSource V2: Inject repartition and sort nodes to satisfy required 
> distribution and ordering
> --
>
> Key: SPARK-34026
> URL: https://issues.apache.org/jira/browse/SPARK-34026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> As the next step towards requesting distribution and ordering on write, we 
> should inject appropriate repartition and sort nodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33981) SparkUI: Storage page is empty even if cached

2021-01-07 Thread Maple (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maple resolved SPARK-33981.
---
Resolution: Fixed

> SparkUI: Storage page is empty even if cached
> -
>
> Key: SPARK-33981
> URL: https://issues.apache.org/jira/browse/SPARK-33981
> Project: Spark
>  Issue Type: Question
>  Components: Web UI
>Affects Versions: 2.3.3
> Environment: spark 2.3.3
>Reporter: Maple
>Priority: Major
> Attachments: ba5a987152c6270f34b968bd89ca36a.png
>
>
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val rdd = sc.parallelize(1 to 10, 
> 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
> at :25
> scala> rdd.count
> res0: Long = 10
> but SparkUI storage page is empty



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34041) Miscellaneous cleanup for new PySpark documentation

2021-01-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34041:
-
Description: 
1. Add a link of quick start in PySpark docs into "Programming Guides" in Spark 
main docs
2. ML MLlib -> MLlib (DataFrame-based)" and "MLlib (RDD-based)"
3. Mention MLlib user guide 
(https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-docs/_site/ml-guide.html)
4. Mention other migration guides as well because PySpark can get affected by 
it.


  was:
1. Add a link of quick start in PySpark docs into "Programming Guides" in Spark 
main docs
2. ML MLlib -> MLlib (DataFrame-based)" and "MLlib (RDD-based)"
3. Mention MLlib user guide 
(https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-docs/_site/ml-guide.html)




> Miscellaneous cleanup for new PySpark documentation
> ---
>
> Key: SPARK-34041
> URL: https://issues.apache.org/jira/browse/SPARK-34041
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> 1. Add a link of quick start in PySpark docs into "Programming Guides" in 
> Spark main docs
> 2. ML MLlib -> MLlib (DataFrame-based)" and "MLlib (RDD-based)"
> 3. Mention MLlib user guide 
> (https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-docs/_site/ml-guide.html)
> 4. Mention other migration guides as well because PySpark can get affected by 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34042) Column pruning is not working as expected for PERMISIVE mode

2021-01-07 Thread Marius Butan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Butan updated SPARK-34042:
-
Description: 
In PERMISSIVE mode

Given a csv with multiple columns per row, if your file schema has a single 
column and you are doing a SELECT in SQL with a condition like 
' is null', the row is marked as corrupted

 

BUT if you add an extra column in the file schema and you are not putting that 
column in SQL SELECT , the row is not marked as corrupted

 

PS. I don't know exactly what is the right behavior, I didn't find it for 
PERMISSIVE mode the documentation.

What I found is: As an example, CSV file contains the "id,name" header and one 
row "1234". In Spark 2.4, the selection of the id column consists of a row with 
one column value 1234 but in Spark 2.3 and earlier, it is empty in the 
DROPMALFORMED mode. To restore the previous behavior, set 
{{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

[https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 

  was:
In PERMISSIVE mode

Given a csv with multiple columns per row, if your file schema has a single 
column and you are doing a SELECT in SQL with a condition like 
' is null', the row is marked as corrupted

 

BUT if you add an extra column in the file schema and you are not putting that 
column in SQL SELECT , the row is not marked as corrupted

 

PS. I don't know exactly what is the right behavior, I didn't find for 
PERMISSIVE mode the documentation.

What I found is: As an example, CSV file contains the "id,name" header and one 
row "1234". In Spark 2.4, the selection of the id column consists of a row with 
one column value 1234 but in Spark 2.3 and earlier, it is empty in the 
DROPMALFORMED mode. To restore the previous behavior, set 
{{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

[https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 


> Column pruning is not working as expected for PERMISIVE mode
> 
>
> Key: SPARK-34042
> URL: https://issues.apache.org/jira/browse/SPARK-34042
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.7
>Reporter: Marius Butan
>Priority: Major
>
> In PERMISSIVE mode
> Given a csv with multiple columns per row, if your file schema has a single 
> column and you are doing a SELECT in SQL with a condition like 
> ' is null', the row is marked as corrupted
>  
> BUT if you add an extra column in the file schema and you are not putting 
> that column in SQL SELECT , the row is not marked as corrupted
>  
> PS. I don't know exactly what is the right behavior, I didn't find it for 
> PERMISSIVE mode the documentation.
> What I found is: As an example, CSV file contains the "id,name" header and 
> one row "1234". In Spark 2.4, the selection of the id column consists of a 
> row with one column value 1234 but in Spark 2.3 and earlier, it is empty in 
> the DROPMALFORMED mode. To restore the previous behavior, set 
> {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.
>  
> [https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]
>  
> I made a "unit" test in order to exemplify the issue: 
> [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34042) Column pruning is not working as expected for PERMISIVE mode

2021-01-07 Thread Marius Butan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Butan updated SPARK-34042:
-
Description: 
In PERMISSIVE mode

Given a csv with multiple columns per row, if your file schema has a single 
column and you are doing a SELECT in SQL with a condition like 
' is null', the row is marked as corrupted

 

BUT if you add an extra column in the file schema and you are not putting that 
column in SQL SELECT , the row is not marked as corrupted

 

PS. I don't know exactly what is the right behavior, I didn't find for 
PERMISSIVE mode the documentation.

What I found is: As an example, CSV file contains the "id,name" header and one 
row "1234". In Spark 2.4, the selection of the id column consists of a row with 
one column value 1234 but in Spark 2.3 and earlier, it is empty in the 
DROPMALFORMED mode. To restore the previous behavior, set 
{{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

[https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 

  was:
In PERMISSIVE mode

Given a csv with multiple columns per row, if your file schema has a single 
column and you are doing a SELECT in SQL with a condition like 
' is null', the row is marked as corrupted

 

BUT if you add an extra column in the file schema and you are not putting that 
column in SQL SELECT , the row is not marked as corrupted

 

PS. I don't know exactly what is the right behavior, I didn't find for 
PERMISSIVE mode the documentation.

What I found is: As an example, CSV file contains the "id,name" header and one 
row "1234". In Spark 2.4, the selection of the id column consists of a row with 
one column value 1234 but in Spark 2.3 and earlier, it is empty in the 
DROPMALFORMED mode. To restore the previous behavior, set 
{{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

[https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 


> Column pruning is not working as expected for PERMISIVE mode
> 
>
> Key: SPARK-34042
> URL: https://issues.apache.org/jira/browse/SPARK-34042
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.7
>Reporter: Marius Butan
>Priority: Major
>
> In PERMISSIVE mode
> Given a csv with multiple columns per row, if your file schema has a single 
> column and you are doing a SELECT in SQL with a condition like 
> ' is null', the row is marked as corrupted
>  
> BUT if you add an extra column in the file schema and you are not putting 
> that column in SQL SELECT , the row is not marked as corrupted
>  
> PS. I don't know exactly what is the right behavior, I didn't find for 
> PERMISSIVE mode the documentation.
> What I found is: As an example, CSV file contains the "id,name" header and 
> one row "1234". In Spark 2.4, the selection of the id column consists of a 
> row with one column value 1234 but in Spark 2.3 and earlier, it is empty in 
> the DROPMALFORMED mode. To restore the previous behavior, set 
> {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.
>  
> [https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]
>  
> I made a "unit" test in order to exemplify the issue: 
> [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34042) Column pruning is not working as expected for PERMISIVE mode

2021-01-07 Thread Marius Butan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Butan updated SPARK-34042:
-
Description: 
In PERMISSIVE mode

Given a csv with multiple columns per row, if your file schema has a single 
column and you are doing a SELECT in SQL with a condition like 
' is null', the row is marked as corrupted

 

BUT if you add an extra column in the file schema and you are not putting that 
column in SQL SELECT , the row is not marked as corrupted

 

PS. I don't know exactly what is the right behavior, I didn't find for 
PERMISSIVE mode the documentation.

What I found is: As an example, CSV file contains the "id,name" header and one 
row "1234". In Spark 2.4, the selection of the id column consists of a row with 
one column value 1234 but in Spark 2.3 and earlier, it is empty in the 
DROPMALFORMED mode. To restore the previous behavior, set 
{{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

[https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 

  was:
In PERMISSIVE mode

Given a csv with multiple columns, if you have in schema a single column and 
you are selecting in SQL with condition that corrupt record to be null, the row 
is mapped as corrupted.

BUT if you add an extra column in csv schema an extra column and you are not 
selecting that column in SQL, the row is not corrupted

 

PS. I don't know exactly what is the right behavior, I didn't find for 
PERMISSIVE mode the documentation. What I found is: As an example, CSV file 
contains the "id,name" header and one row "1234". In Spark 2.4, the selection 
of the id column consists of a row with one column value 1234 but in Spark 2.3 
and earlier, it is empty in the DROPMALFORMED mode. To restore the previous 
behavior, set {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

[https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 


> Column pruning is not working as expected for PERMISIVE mode
> 
>
> Key: SPARK-34042
> URL: https://issues.apache.org/jira/browse/SPARK-34042
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.7
>Reporter: Marius Butan
>Priority: Major
>
> In PERMISSIVE mode
> Given a csv with multiple columns per row, if your file schema has a single 
> column and you are doing a SELECT in SQL with a condition like 
> ' is null', the row is marked as corrupted
>  
> BUT if you add an extra column in the file schema and you are not putting 
> that column in SQL SELECT , the row is not marked as corrupted
>  
> PS. I don't know exactly what is the right behavior, I didn't find for 
> PERMISSIVE mode the documentation.
> What I found is: As an example, CSV file contains the "id,name" header and 
> one row "1234". In Spark 2.4, the selection of the id column consists of a 
> row with one column value 1234 but in Spark 2.3 and earlier, it is empty in 
> the DROPMALFORMED mode. To restore the previous behavior, set 
> {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.
>  
> [https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]
>  
> I made a "unit" test in order to exemplify the issue: 
> [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE

2021-01-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33933:
---

Assignee: Yu Zhong

> Broadcast timeout happened unexpectedly in AQE 
> ---
>
> Key: SPARK-33933
> URL: https://issues.apache.org/jira/browse/SPARK-33933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yu Zhong
>Assignee: Yu Zhong
>Priority: Major
> Fix For: 3.1.1
>
>
> In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal 
> queries as below.
>  
> {code:java}
> Could not execute broadcast in 300 secs. You can increase the timeout for 
> broadcasts via spark.sql.broadcastTimeout or disable broadcast join by 
> setting spark.sql.autoBroadcastJoinThreshold to -1
> {code}
>  
> This is usually happens when broadcast join(with or without hint) after a 
> long running shuffle (more than 5 minutes).  By disable AQE, the issues 
> disappear.
> The workaround is to increase spark.sql.broadcastTimeout and it works. But 
> because the data to broadcast is very small, that doesn't make sense.
> After investigation, the root cause should be like this: when enable AQE, in 
> getFinalPhysicalPlan, spark traversal the physical plan bottom up and create 
> query stage for materialized part by createQueryStages and materialize those 
> new created query stages to submit map stages or broadcasting. When 
> ShuffleQueryStage are materializing before BroadcastQueryStage, the map job 
> and broadcast job are submitted almost at the same time, but map job will 
> hold all the computing resources. If the map job runs slow (when lots of data 
> needs to process and the resource is limited), the broadcast job cannot be 
> started(and finished) before spark.sql.broadcastTimeout, thus cause whole job 
> failed (introduced in SPARK-31475).
> Code to reproduce:
>  
> {code:java}
> import java.util.UUID
> import scala.util.Random
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SparkSession
> val spark = SparkSession.builder()
>   .master("local[2]")
>   .appName("Test Broadcast").getOrCreate()
> import spark.implicits._
> spark.conf.set("spark.sql.adaptive.enabled", "true")
> val sc = spark.sparkContext
> sc.setLogLevel("INFO")
> val uuid = UUID.randomUUID
> val df = sc.parallelize(Range(0, 1), 1).flatMap(x => {
>   for (i <- Range(0, 1 + Random.nextInt(1)))
> yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString)
> }).toDF("index", "part", "pv", "uuid")
>   .withColumn("md5", md5($"uuid"))
> val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x))
> val dim = dim_data.toDF("name", "index")
> val result = df.groupBy("index")
>   .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv"))
>   .join(dim, Seq("index"))
>   .collect(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE

2021-01-07 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33933.
-
Fix Version/s: 3.1.1
   Resolution: Fixed

Issue resolved by pull request 30998
[https://github.com/apache/spark/pull/30998]

> Broadcast timeout happened unexpectedly in AQE 
> ---
>
> Key: SPARK-33933
> URL: https://issues.apache.org/jira/browse/SPARK-33933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yu Zhong
>Priority: Major
> Fix For: 3.1.1
>
>
> In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal 
> queries as below.
>  
> {code:java}
> Could not execute broadcast in 300 secs. You can increase the timeout for 
> broadcasts via spark.sql.broadcastTimeout or disable broadcast join by 
> setting spark.sql.autoBroadcastJoinThreshold to -1
> {code}
>  
> This is usually happens when broadcast join(with or without hint) after a 
> long running shuffle (more than 5 minutes).  By disable AQE, the issues 
> disappear.
> The workaround is to increase spark.sql.broadcastTimeout and it works. But 
> because the data to broadcast is very small, that doesn't make sense.
> After investigation, the root cause should be like this: when enable AQE, in 
> getFinalPhysicalPlan, spark traversal the physical plan bottom up and create 
> query stage for materialized part by createQueryStages and materialize those 
> new created query stages to submit map stages or broadcasting. When 
> ShuffleQueryStage are materializing before BroadcastQueryStage, the map job 
> and broadcast job are submitted almost at the same time, but map job will 
> hold all the computing resources. If the map job runs slow (when lots of data 
> needs to process and the resource is limited), the broadcast job cannot be 
> started(and finished) before spark.sql.broadcastTimeout, thus cause whole job 
> failed (introduced in SPARK-31475).
> Code to reproduce:
>  
> {code:java}
> import java.util.UUID
> import scala.util.Random
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SparkSession
> val spark = SparkSession.builder()
>   .master("local[2]")
>   .appName("Test Broadcast").getOrCreate()
> import spark.implicits._
> spark.conf.set("spark.sql.adaptive.enabled", "true")
> val sc = spark.sparkContext
> sc.setLogLevel("INFO")
> val uuid = UUID.randomUUID
> val df = sc.parallelize(Range(0, 1), 1).flatMap(x => {
>   for (i <- Range(0, 1 + Random.nextInt(1)))
> yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString)
> }).toDF("index", "part", "pv", "uuid")
>   .withColumn("md5", md5($"uuid"))
> val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x))
> val dim = dim_data.toDF("name", "index")
> val result = df.groupBy("index")
>   .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv"))
>   .join(dim, Seq("index"))
>   .collect(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34041) Miscellaneous cleanup for new PySpark documentation

2021-01-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260327#comment-17260327
 ] 

Apache Spark commented on SPARK-34041:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/31082

> Miscellaneous cleanup for new PySpark documentation
> ---
>
> Key: SPARK-34041
> URL: https://issues.apache.org/jira/browse/SPARK-34041
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> 1. Add a link of quick start in PySpark docs into "Programming Guides" in 
> Spark main docs
> 2. ML MLlib -> MLlib (DataFrame-based)" and "MLlib (RDD-based)"
> 3. Mention MLlib user guide 
> (https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-docs/_site/ml-guide.html)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34041) Miscellaneous cleanup for new PySpark documentation

2021-01-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34041:


Assignee: Apache Spark

> Miscellaneous cleanup for new PySpark documentation
> ---
>
> Key: SPARK-34041
> URL: https://issues.apache.org/jira/browse/SPARK-34041
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> 1. Add a link of quick start in PySpark docs into "Programming Guides" in 
> Spark main docs
> 2. ML MLlib -> MLlib (DataFrame-based)" and "MLlib (RDD-based)"
> 3. Mention MLlib user guide 
> (https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-docs/_site/ml-guide.html)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34041) Miscellaneous cleanup for new PySpark documentation

2021-01-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260325#comment-17260325
 ] 

Apache Spark commented on SPARK-34041:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/31082

> Miscellaneous cleanup for new PySpark documentation
> ---
>
> Key: SPARK-34041
> URL: https://issues.apache.org/jira/browse/SPARK-34041
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> 1. Add a link of quick start in PySpark docs into "Programming Guides" in 
> Spark main docs
> 2. ML MLlib -> MLlib (DataFrame-based)" and "MLlib (RDD-based)"
> 3. Mention MLlib user guide 
> (https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-docs/_site/ml-guide.html)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34041) Miscellaneous cleanup for new PySpark documentation

2021-01-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34041:


Assignee: (was: Apache Spark)

> Miscellaneous cleanup for new PySpark documentation
> ---
>
> Key: SPARK-34041
> URL: https://issues.apache.org/jira/browse/SPARK-34041
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> 1. Add a link of quick start in PySpark docs into "Programming Guides" in 
> Spark main docs
> 2. ML MLlib -> MLlib (DataFrame-based)" and "MLlib (RDD-based)"
> 3. Mention MLlib user guide 
> (https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-docs/_site/ml-guide.html)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34042) Column pruning is not working as expected for PERMISIVE mode

2021-01-07 Thread Marius Butan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marius Butan updated SPARK-34042:
-
Description: 
In PERMISSIVE mode

Given a csv with multiple columns, if you have in schema a single column and 
you are selecting in SQL with condition that corrupt record to be null, the row 
is mapped as corrupted.

BUT if you add an extra column in csv schema an extra column and you are not 
selecting that column in SQL, the row is not corrupted

 

PS. I don't know exactly what is the right behavior, I didn't find for 
PERMISSIVE mode the documentation. What I found is: As an example, CSV file 
contains the "id,name" header and one row "1234". In Spark 2.4, the selection 
of the id column consists of a row with one column value 1234 but in Spark 2.3 
and earlier, it is empty in the DROPMALFORMED mode. To restore the previous 
behavior, set {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

[https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 

  was:
In PERMISSIVE mode

Given a csv with multiple columns, if you have in schema a single column and 
you are selecting in SQL with condition that corrupt record to be null, the row 
is mapped as corrupted.

BUT if you add an extra column in csv schema an extra column and you are not 
select that column in SQL, the row is not corrupted

 

PS. I don't know exactly what is the right behaviour, I didn't find for 
PERMISSIVE mode the documentation. What I found is: As an example, CSV file 
contains the "id,name" header and one row "1234". In Spark 2.4, selection of 
the id column consists of a row with one column value 1234 but in Spark 2.3 and 
earlier it is empty in the DROPMALFORMED mode. To restore the previous 
behavior, set {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 


> Column pruning is not working as expected for PERMISIVE mode
> 
>
> Key: SPARK-34042
> URL: https://issues.apache.org/jira/browse/SPARK-34042
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.7
>Reporter: Marius Butan
>Priority: Major
>
> In PERMISSIVE mode
> Given a csv with multiple columns, if you have in schema a single column and 
> you are selecting in SQL with condition that corrupt record to be null, the 
> row is mapped as corrupted.
> BUT if you add an extra column in csv schema an extra column and you are not 
> selecting that column in SQL, the row is not corrupted
>  
> PS. I don't know exactly what is the right behavior, I didn't find for 
> PERMISSIVE mode the documentation. What I found is: As an example, CSV file 
> contains the "id,name" header and one row "1234". In Spark 2.4, the selection 
> of the id column consists of a row with one column value 1234 but in Spark 
> 2.3 and earlier, it is empty in the DROPMALFORMED mode. To restore the 
> previous behavior, set {{spark.sql.csv.parser.columnPruning.enabled}} to 
> {{false}}.
>  
> [https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html]
>  
> I made a "unit" test in order to exemplify the issue: 
> [https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34042) Column pruning is not working as expected for PERMISIVE mode

2021-01-07 Thread Marius Butan (Jira)
Marius Butan created SPARK-34042:


 Summary: Column pruning is not working as expected for PERMISIVE 
mode
 Key: SPARK-34042
 URL: https://issues.apache.org/jira/browse/SPARK-34042
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.4.7
Reporter: Marius Butan


In PERMISSIVE mode

Given a csv with multiple columns, if you have in schema a single column and 
you are selecting in SQL with condition that corrupt record to be null, the row 
is mapped as corrupted.

BUT if you add an extra column in csv schema an extra column and you are not 
select that column in SQL, the row is not corrupted

 

PS. I don't know exactly what is the right behaviour, I didn't find for 
PERMISSIVE mode the documentation. What I found is: As an example, CSV file 
contains the "id,name" header and one row "1234". In Spark 2.4, selection of 
the id column consists of a row with one column value 1234 but in Spark 2.3 and 
earlier it is empty in the DROPMALFORMED mode. To restore the previous 
behavior, set {{spark.sql.csv.parser.columnPruning.enabled}} to {{false}}.

 

https://spark.apache.org/docs/2.4.0/sql-migration-guide-upgrade.html

 

I made a "unit" test in order to exemplify the issue: 
[https://github.com/butzy92/spark-column-mapping-issue/blob/master/src/test/java/spark/test/SparkTest.java]

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34041) Miscellaneous cleanup for new PySpark documentation

2021-01-07 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34041:
-
Description: 
1. Add a link of quick start in PySpark docs into "Programming Guides" in Spark 
main docs
2. ML MLlib -> MLlib (DataFrame-based)" and "MLlib (RDD-based)"
3. Mention MLlib user guide 
(https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-docs/_site/ml-guide.html)



  was:
1. Add a link of quick start in PySpark docs into "Programming Guides" in Spark 
main docs
2. ML MLlib -> MLlib (DataFrame-based)" and "MLlib (RDD-based)"
3. Mention ML's migration doc 
(https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-docs/_site/ml-guide.html)



> Miscellaneous cleanup for new PySpark documentation
> ---
>
> Key: SPARK-34041
> URL: https://issues.apache.org/jira/browse/SPARK-34041
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> 1. Add a link of quick start in PySpark docs into "Programming Guides" in 
> Spark main docs
> 2. ML MLlib -> MLlib (DataFrame-based)" and "MLlib (RDD-based)"
> 3. Mention MLlib user guide 
> (https://dist.apache.org/repos/dist/dev/spark/v3.1.0-rc1-docs/_site/ml-guide.html)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34039) [DSv2] ReplaceTable should invalidate cache

2021-01-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260301#comment-17260301
 ] 

Apache Spark commented on SPARK-34039:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/31081

> [DSv2] ReplaceTable should invalidate cache
> ---
>
> Key: SPARK-34039
> URL: https://issues.apache.org/jira/browse/SPARK-34039
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Similar to SPARK-33492, which handles the {{ReplaceTableAsSelect}} case, we 
> should also invalidate table cache in {{ReplaceTable}} v2 command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34039) [DSv2] ReplaceTable should invalidate cache

2021-01-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34039:


Assignee: (was: Apache Spark)

> [DSv2] ReplaceTable should invalidate cache
> ---
>
> Key: SPARK-34039
> URL: https://issues.apache.org/jira/browse/SPARK-34039
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Similar to SPARK-33492, which handles the {{ReplaceTableAsSelect}} case, we 
> should also invalidate table cache in {{ReplaceTable}} v2 command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34039) [DSv2] ReplaceTable should invalidate cache

2021-01-07 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17260299#comment-17260299
 ] 

Apache Spark commented on SPARK-34039:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/31081

> [DSv2] ReplaceTable should invalidate cache
> ---
>
> Key: SPARK-34039
> URL: https://issues.apache.org/jira/browse/SPARK-34039
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Similar to SPARK-33492, which handles the {{ReplaceTableAsSelect}} case, we 
> should also invalidate table cache in {{ReplaceTable}} v2 command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34039) [DSv2] ReplaceTable should invalidate cache

2021-01-07 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34039:


Assignee: Apache Spark

> [DSv2] ReplaceTable should invalidate cache
> ---
>
> Key: SPARK-34039
> URL: https://issues.apache.org/jira/browse/SPARK-34039
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> Similar to SPARK-33492, which handles the {{ReplaceTableAsSelect}} case, we 
> should also invalidate table cache in {{ReplaceTable}} v2 command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org