[jira] [Commented] (SPARK-37330) Migrate ReplaceTableStatement to v2 command

2021-11-14 Thread dch nguyen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443609#comment-17443609
 ] 

dch nguyen commented on SPARK-37330:


I am working on this

> Migrate ReplaceTableStatement to v2 command
> ---
>
> Key: SPARK-37330
> URL: https://issues.apache.org/jira/browse/SPARK-37330
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37330) Migrate ReplaceTableStatement to v2 command

2021-11-14 Thread dch nguyen (Jira)
dch nguyen created SPARK-37330:
--

 Summary: Migrate ReplaceTableStatement to v2 command
 Key: SPARK-37330
 URL: https://issues.apache.org/jira/browse/SPARK-37330
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: dch nguyen






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37329) File system delegation tokens are leaked

2021-11-14 Thread Wei-Chiu Chuang (Jira)
Wei-Chiu Chuang created SPARK-37329:
---

 Summary: File system delegation tokens are leaked
 Key: SPARK-37329
 URL: https://issues.apache.org/jira/browse/SPARK-37329
 Project: Spark
  Issue Type: Bug
  Components: Security, YARN
Affects Versions: 2.4.0
Reporter: Wei-Chiu Chuang


On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS 
accumulated millions of delegation tokens that are not cancelled even after 
jobs are finished, and KMS goes out of memory within a day because of the 
delegation token leak.

We were able to reproduce the bug in a smaller test cluster, and realized when 
a Spark job starts, it acquires two delegation tokens, and only one is 
cancelled properly after the job finishes. The other one is left over and 
linger around for up to 7 days ( default Hadoop delegation token life time).

YARN handles the lifecycle of a delegation token properly if its renewer is 
'yarn'. However, Spark intentionally (a hack?) acquires a second delegation 
token with the job issuer as the renewer, simply to get the token renewal 
interval. The token is then ignored but not cancelled.

Propose: cancel the delegation token immediately after the token renewal 
interval is obtained.

Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got 
introduced since day 1.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37328) SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was applied onn whole plan innstead of new stage plan

2021-11-14 Thread Lietong Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lietong Liu updated SPARK-37328:

Description: 
Since OptimizeSkewedJoin was moved from queryStageOptimizerRules to 

queryStagePreparationRules, the position OptimizeSkewedJoin was applied has 
been moved from newQueryStage() to reOptimize(). The plan OptimizeSkewedJoin 
applied on changed from plan of new stage which is about to submit to whole 
spark plan.

In the cases where skewedJoin is not last stage, OptimizeSkewedJoin may not 
work because the number of collected shuffleStages is more than 2.

The following test will prove it:

 

 
{code:java}
test("OptimizeSkewJoin may not work") {
  withSQLConf(
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
SQLConf.SKEW_JOIN_SKEWED_PARTITION_THRESHOLD.key -> "100",
SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES.key -> "100",
SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM.key -> "1",
SQLConf.SHUFFLE_PARTITIONS.key -> "10") {
withTempView("skewData1", "skewData2", "skewData3") {
  spark
.range(0, 1000, 1, 10)
.selectExpr("id % 3 as key1", "id % 3 as value1")
.createOrReplaceTempView("skewData1")
  spark
.range(0, 1000, 1, 10)
.selectExpr("id % 1 as key2", "id as value2")
.createOrReplaceTempView("skewData2")
  spark
.range(0, 1000, 1, 10)
.selectExpr("id % 1 as key3", "id as value3")
.createOrReplaceTempView("skewData3")

  // Query has two skewedJoin in two continuous stages.
  val (_, adaptive1) =
runAdaptiveAndVerifyResult(
  """
|SELECT key1 FROM skewData1 s1
|JOIN skewData2 s2
|ON s1.key1 = s2.key2
|JOIN skewData3
|ON s1.value1 = value3
|""".stripMargin)
  val shuffles1 = collect(adaptive1) {
case s: ShuffleExchangeExec => s
  }
  assert(shuffles1.size == 4)
  val smj1 = findTopLevelSortMergeJoin(adaptive1)
  assert(smj1.size == 2 && smj1.forall(_.isSkewJoin))
}
  }
} {code}
I'll open a PR shortly to fix this issue

 

  was:
Since OptimizeSkewedJoin was moved from queryStageOptimizerRules to 

queryStagePreparationRules, the position OptimizeSkewedJoin was applied has 
been moved from newQueryStage() to reOptimize(). The plan OptimizeSkewedJoin 
applied on changed from plan of new stage which is about to submit to whole 
spark plan.

In the cases where skewedJoin is not last stage, OptimizeSkewedJoin may not 
work because the number of collected shuffleStages is more than 2.

The following test will prove it:

 

 
{code:java}

test("OptimizeSkewJoin may not work") {
  withSQLConf(
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
SQLConf.SKEW_JOIN_SKEWED_PARTITION_THRESHOLD.key -> "100",
SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES.key -> "100",
SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM.key -> "1",
SQLConf.SHUFFLE_PARTITIONS.key -> "10") {
withTempView("skewData1", "skewData2", "skewData3") {
  spark
.range(0, 1000, 1, 10)
.selectExpr("id % 3 as key1", "id % 3 as value1")
.createOrReplaceTempView("skewData1")
  spark
.range(0, 1000, 1, 10)
.selectExpr("id % 1 as key2", "id as value2")
.createOrReplaceTempView("skewData2")
  spark
.range(0, 1000, 1, 10)
.selectExpr("id % 1 as key3", "id as value3")
.createOrReplaceTempView("skewData3")

  // Query has two skewedJoin in two continuous stages.
  val (_, adaptive1) =
runAdaptiveAndVerifyResult(
  """
|SELECT key1 FROM skewData1 s1
|JOIN skewData2 s2
|ON s1.key1 = s2.key2
|JOIN skewData3
|ON s1.value1 = value3
|""".stripMargin)
  val shuffles1 = collect(adaptive1) {
case s: ShuffleExchangeExec => s
  }
  assert(shuffles1.size == 4)
  val smj1 = findTopLevelSortMergeJoin(adaptive1)
  assert(smj1.size == 2 && smj1.forall(_.isSkewJoin))
}
  }
} {code}
 

 


> SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was 
> applied onn whole plan innstead of new stage plan
> --
>
> Key: SPARK-37328
> URL: https://issues.apache.org/jira/browse/SPARK-37328
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Lietong Liu
>Priority: Major
>
> Since OptimizeSkewedJoin was moved from queryStageOptimizerRules to 
> queryStagePreparationRules, the position OptimizeSkewedJoin was applied has 
> been moved from newQueryStage() to reOpt

[jira] [Created] (SPARK-37328) SPARK-33832 brings the bug that OptimizeSkewedJoin may not work since it was applied onn whole plan innstead of new stage plan

2021-11-14 Thread Lietong Liu (Jira)
Lietong Liu created SPARK-37328:
---

 Summary: SPARK-33832 brings the bug that OptimizeSkewedJoin may 
not work since it was applied onn whole plan innstead of new stage plan
 Key: SPARK-37328
 URL: https://issues.apache.org/jira/browse/SPARK-37328
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Lietong Liu


Since OptimizeSkewedJoin was moved from queryStageOptimizerRules to 

queryStagePreparationRules, the position OptimizeSkewedJoin was applied has 
been moved from newQueryStage() to reOptimize(). The plan OptimizeSkewedJoin 
applied on changed from plan of new stage which is about to submit to whole 
spark plan.

In the cases where skewedJoin is not last stage, OptimizeSkewedJoin may not 
work because the number of collected shuffleStages is more than 2.

The following test will prove it:

 

 
{code:java}

test("OptimizeSkewJoin may not work") {
  withSQLConf(
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true",
SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1",
SQLConf.SKEW_JOIN_SKEWED_PARTITION_THRESHOLD.key -> "100",
SQLConf.ADVISORY_PARTITION_SIZE_IN_BYTES.key -> "100",
SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM.key -> "1",
SQLConf.SHUFFLE_PARTITIONS.key -> "10") {
withTempView("skewData1", "skewData2", "skewData3") {
  spark
.range(0, 1000, 1, 10)
.selectExpr("id % 3 as key1", "id % 3 as value1")
.createOrReplaceTempView("skewData1")
  spark
.range(0, 1000, 1, 10)
.selectExpr("id % 1 as key2", "id as value2")
.createOrReplaceTempView("skewData2")
  spark
.range(0, 1000, 1, 10)
.selectExpr("id % 1 as key3", "id as value3")
.createOrReplaceTempView("skewData3")

  // Query has two skewedJoin in two continuous stages.
  val (_, adaptive1) =
runAdaptiveAndVerifyResult(
  """
|SELECT key1 FROM skewData1 s1
|JOIN skewData2 s2
|ON s1.key1 = s2.key2
|JOIN skewData3
|ON s1.value1 = value3
|""".stripMargin)
  val shuffles1 = collect(adaptive1) {
case s: ShuffleExchangeExec => s
  }
  assert(shuffles1.size == 4)
  val smj1 = findTopLevelSortMergeJoin(adaptive1)
  assert(smj1.size == 2 && smj1.forall(_.isSkewJoin))
}
  }
} {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37327) Silence the to_pandas() advice log for internal usage

2021-11-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37327:


Assignee: Apache Spark

> Silence the to_pandas() advice log for internal usage
> -
>
> Key: SPARK-37327
> URL: https://issues.apache.org/jira/browse/SPARK-37327
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Raised from comment 
> [https://github.com/apache/spark/pull/34389#discussion_r741733023].
> The advice warning for pandas API on Spark for to_pandas() issuing too much 
> message e.g. when user runs the plotting functions, so we want to silence the 
> warning message when it's used as an internal purpose.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37327) Silence the to_pandas() advice log for internal usage

2021-11-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443552#comment-17443552
 ] 

Apache Spark commented on SPARK-37327:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/34598

> Silence the to_pandas() advice log for internal usage
> -
>
> Key: SPARK-37327
> URL: https://issues.apache.org/jira/browse/SPARK-37327
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Raised from comment 
> [https://github.com/apache/spark/pull/34389#discussion_r741733023].
> The advice warning for pandas API on Spark for to_pandas() issuing too much 
> message e.g. when user runs the plotting functions, so we want to silence the 
> warning message when it's used as an internal purpose.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37327) Silence the to_pandas() advice log for internal usage

2021-11-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37327:


Assignee: (was: Apache Spark)

> Silence the to_pandas() advice log for internal usage
> -
>
> Key: SPARK-37327
> URL: https://issues.apache.org/jira/browse/SPARK-37327
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Raised from comment 
> [https://github.com/apache/spark/pull/34389#discussion_r741733023].
> The advice warning for pandas API on Spark for to_pandas() issuing too much 
> message e.g. when user runs the plotting functions, so we want to silence the 
> warning message when it's used as an internal purpose.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37327) Silence the to_pandas() advice log for internal usage

2021-11-14 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37327:

Description: 
Raised from comment 
[https://github.com/apache/spark/pull/34389#discussion_r741733023].

The advice warning for pandas API on Spark for to_pandas() issuing too much 
message e.g. when user runs the plotting functions, so we want to silence the 
warning message when it's used as an internal purpose.

  was:
Raised from comment 
[https://github.com/apache/spark/pull/34389#discussion_r741733023].

The advice warning for pandas API on Spark for expensive APIs 
([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).],
 but it's issuing too much message, so we want to silence the warning message 
when it's used as an internal purpose.


> Silence the to_pandas() advice log for internal usage
> -
>
> Key: SPARK-37327
> URL: https://issues.apache.org/jira/browse/SPARK-37327
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Raised from comment 
> [https://github.com/apache/spark/pull/34389#discussion_r741733023].
> The advice warning for pandas API on Spark for to_pandas() issuing too much 
> message e.g. when user runs the plotting functions, so we want to silence the 
> warning message when it's used as an internal purpose.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37327) Silence the to_pandas() advice log for internal usage

2021-11-14 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-37327:

Summary: Silence the to_pandas() advice log for internal usage  (was: 
Silence the advice log for internal usage)

> Silence the to_pandas() advice log for internal usage
> -
>
> Key: SPARK-37327
> URL: https://issues.apache.org/jira/browse/SPARK-37327
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Raised from comment 
> [https://github.com/apache/spark/pull/34389#discussion_r741733023].
> The advice warning for pandas API on Spark for expensive APIs 
> ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).],
>  but it's issuing too much message, so we want to silence the warning message 
> when it's used as an internal purpose.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-37299) Fix Python linter failure in branch-3.1

2021-11-14 Thread Haejoon Lee (Jira)


[ https://issues.apache.org/jira/browse/SPARK-37299 ]


Haejoon Lee deleted comment on SPARK-37299:
-

was (Author: itholic):
Resolved at https://issues.apache.org/jira/browse/SPARK-37323

> Fix Python linter failure in branch-3.1
> ---
>
> Key: SPARK-37299
> URL: https://issues.apache.org/jira/browse/SPARK-37299
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The Python linter is failing for some reason in branch-3.1.
> [https://github.com/apache/spark/runs/4151205472?check_suite_focus=true]
> We should fix it



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37299) Fix Python linter failure in branch-3.1

2021-11-14 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443544#comment-17443544
 ] 

Haejoon Lee commented on SPARK-37299:
-

Resolved at https://issues.apache.org/jira/browse/SPARK-37323

> Fix Python linter failure in branch-3.1
> ---
>
> Key: SPARK-37299
> URL: https://issues.apache.org/jira/browse/SPARK-37299
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The Python linter is failing for some reason in branch-3.1.
> [https://github.com/apache/spark/runs/4151205472?check_suite_focus=true]
> We should fix it



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36533) Allow streaming queries with Trigger.Once run in multiple batches

2021-11-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443538#comment-17443538
 ] 

Apache Spark commented on SPARK-36533:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/34597

> Allow streaming queries with Trigger.Once run in multiple batches
> -
>
> Key: SPARK-36533
> URL: https://issues.apache.org/jira/browse/SPARK-36533
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently streaming queries with Trigger.Once will always load all of the 
> available data in a single batch. Because of this, the amount of data the 
> queries can process is limited, or Spark driver will be out of memory. 
> We should allow streaming queries with Trigger.Once run in multiple batches.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37326) Support TimestampNTZ in CSV data source

2021-11-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443537#comment-17443537
 ] 

Apache Spark commented on SPARK-37326:
--

User 'sadikovi' has created a pull request for this issue:
https://github.com/apache/spark/pull/34596

> Support TimestampNTZ in CSV data source
> ---
>
> Key: SPARK-37326
> URL: https://issues.apache.org/jira/browse/SPARK-37326
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Ivan Sadikov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37326) Support TimestampNTZ in CSV data source

2021-11-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37326:


Assignee: Apache Spark

> Support TimestampNTZ in CSV data source
> ---
>
> Key: SPARK-37326
> URL: https://issues.apache.org/jira/browse/SPARK-37326
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Ivan Sadikov
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37326) Support TimestampNTZ in CSV data source

2021-11-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37326:


Assignee: (was: Apache Spark)

> Support TimestampNTZ in CSV data source
> ---
>
> Key: SPARK-37326
> URL: https://issues.apache.org/jira/browse/SPARK-37326
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Ivan Sadikov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37327) Silence the advice log for internal usage

2021-11-14 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-37327:
---

 Summary: Silence the advice log for internal usage
 Key: SPARK-37327
 URL: https://issues.apache.org/jira/browse/SPARK-37327
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Haejoon Lee


Raised from comment 
[https://github.com/apache/spark/pull/34389#discussion_r741733023].

The advice warning for pandas API on Spark for expensive APIs 
([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).],
 but it's issuing too much message, so we want to silence the warning message 
when it's used as an internal purpose.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37326) Support TimestampNTZ in CSV data source

2021-11-14 Thread Ivan Sadikov (Jira)
Ivan Sadikov created SPARK-37326:


 Summary: Support TimestampNTZ in CSV data source
 Key: SPARK-37326
 URL: https://issues.apache.org/jira/browse/SPARK-37326
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Ivan Sadikov






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-14 Thread liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liu updated SPARK-37325:

Description: 
schema = StructType([
StructField("node", StringType())
])
{{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}}
{{rdd = rdd.map(lambda obj: \{'node': obj})}}
{{df_node = spark.createDataFrame(rdd, schema=schema)}}

df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()

{{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}}
{{def udf_match(word: pd.Series) -> pd.Series:}}
{{  my_Series = pd_fname.squeeze() # dataframe to Series}}
{{  num = int(my_Series.str.contains(word.array[0]).sum())}}
     return pd.Series(num)

{{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}}


Hi, I have two dataframe, and I try above method, however, I get this
{{RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1}}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the really data it came. I 
checked the data, can't figure out,

does anyone thinks where it cause?

  was:
schema = StructType([
StructField("node", StringType())
])
{{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}}
{{rdd = rdd.map(lambda obj: \{'node': obj})}}
{{df_node = spark.createDataFrame(rdd, schema=schema)}}

df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()

{{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}}
{{def udf_match(word: pd.Series) -> pd.Series:}}
{{  my_Series = pd_fname.squeeze() # dataframe to Series}}
{{  num = int(my_Series.str.contains(word.array[0]).sum())}}
     return pd.Series(num)

{{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}}
Hi, I have two dataframe, and I try above method, however, I get this
{{RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1}}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the really data it came. I 
checked the data, can't figure out,

does anyone thinks where it cause?


> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
> schema = StructType([
> StructField("node", StringType())
> ])
> {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}}
> {{rdd = rdd.map(lambda obj: \{'node': obj})}}
> {{df_node = spark.createDataFrame(rdd, schema=schema)}}
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}}
> {{def udf_match(word: pd.Series) -> pd.Series:}}
> {{  my_Series = pd_fname.squeeze() # dataframe to Series}}
> {{  num = int(my_Series.str.contains(word.array[0]).sum())}}
>      return pd.Series(num)
> {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}}
> Hi, I have two dataframe, and I try above method, however, I get this
> {{RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1}}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the really data it came. 
> I checked the data, can't figure out,
> does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-14 Thread liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liu updated SPARK-37325:

Description: 
schema = StructType([
StructField("node", StringType())
])
{{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}}
{{rdd = rdd.map(lambda obj: \{'node': obj})}}
{{df_node = spark.createDataFrame(rdd, schema=schema)}}

df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()

{{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}}
{{def udf_match(word: pd.Series) -> pd.Series:}}
{{  my_Series = pd_fname.squeeze() # dataframe to Series}}
{{  num = int(my_Series.str.contains(word.array[0]).sum())}}
     return pd.Series(num)

{{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}}
Hi, I have two dataframe, and I try above method, however, I get this
{{RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1}}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the really data it came. I 
checked the data, can't figure out,

does anyone thinks where it cause?

  was:
{{schema = StructType([
StructField("node", StringType())
])}}
{{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}}
{{rdd = rdd.map(lambda obj: \{'node': obj})}}
{{df_node = spark.createDataFrame(rdd, schema=schema)}}
{{}}
{{}}
{{df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()}}
{{}}
{{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}}
{{def udf_match(word: pd.Series) -> pd.Series:}}
{{  my_Series = pd_fname.squeeze()# dataframe to Series}}
{{  num = int(my_Series.str.contains(word.array[0]).sum())}}
  {{return pd.Series(num)}}
{{}}
{{}}
{{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}}
Hi, I have two dataframe, and I try above method, however, I get this
{{RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1}}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the really data it came. I 
checked the data, can't figure out,

does anyone thinks where it cause?


> Result vector from pandas_udf was not the required length
> -
>
> Key: SPARK-37325
> URL: https://issues.apache.org/jira/browse/SPARK-37325
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: 1
>Reporter: liu
>Priority: Major
>
> schema = StructType([
> StructField("node", StringType())
> ])
> {{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}}
> {{rdd = rdd.map(lambda obj: \{'node': obj})}}
> {{df_node = spark.createDataFrame(rdd, schema=schema)}}
> df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
> pd_fname = df_fname.select('fname').toPandas()
> {{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}}
> {{def udf_match(word: pd.Series) -> pd.Series:}}
> {{  my_Series = pd_fname.squeeze() # dataframe to Series}}
> {{  num = int(my_Series.str.contains(word.array[0]).sum())}}
>      return pd.Series(num)
> {{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}}
> Hi, I have two dataframe, and I try above method, however, I get this
> {{RuntimeError: Result vector from pandas_udf was not the required length: 
> expected 100, got 1}}
> it will be really thankful, if there is any helps
>  
> PS: for the method itself, I think there is no problem, I create same sample 
> data to verify it successfully, however, when I use the really data it came. 
> I checked the data, can't figure out,
> does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37325) Result vector from pandas_udf was not the required length

2021-11-14 Thread liu (Jira)
liu created SPARK-37325:
---

 Summary: Result vector from pandas_udf was not the required length
 Key: SPARK-37325
 URL: https://issues.apache.org/jira/browse/SPARK-37325
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0
 Environment: 1
Reporter: liu


{{schema = StructType([
StructField("node", StringType())
])}}
{{rdd = sc.textFile("hdfs:///user/liubiao/KG/graph_dict.txt")}}
{{rdd = rdd.map(lambda obj: \{'node': obj})}}
{{df_node = spark.createDataFrame(rdd, schema=schema)}}
{{}}
{{}}
{{df_fname =spark.read.parquet("hdfs:///user/liubiao/KG/fnames.parquet")
pd_fname = df_fname.select('fname').toPandas()}}
{{}}
{{@pandas_udf(IntegerType(), PandasUDFType.SCALAR)}}
{{def udf_match(word: pd.Series) -> pd.Series:}}
{{  my_Series = pd_fname.squeeze()# dataframe to Series}}
{{  num = int(my_Series.str.contains(word.array[0]).sum())}}
  {{return pd.Series(num)}}
{{}}
{{}}
{{df = df_node.withColumn("match_fname_num", udf_match(df_node["node"]))}}
Hi, I have two dataframe, and I try above method, however, I get this
{{RuntimeError: Result vector from pandas_udf was not the required length: 
expected 100, got 1}}
it will be really thankful, if there is any helps

 

PS: for the method itself, I think there is no problem, I create same sample 
data to verify it successfully, however, when I use the really data it came. I 
checked the data, can't figure out,

does anyone thinks where it cause?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37309) Add available_now to PySpark trigger function

2021-11-14 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443517#comment-17443517
 ] 

Jungtaek Lim commented on SPARK-37309:
--

Ah wait... My follow-up PR included the method doc in PySpark, but missed to 
update the SS guide doc. I'll deal with this.

> Add available_now to PySpark trigger function
> -
>
> Key: SPARK-37309
> URL: https://issues.apache.org/jira/browse/SPARK-37309
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Lennox Stevenson
>Priority: Minor
>
> With the release of the new streaming query trigger called `AvailableNow` in 
> v3.2.0, it would be helpful to also include the trigger as part of the 
> pyspark API. Making the trigger a part of the pyspark API would make it clear 
> that this trigger is an option for python developers. This change to the 
> pyspark API should be accompanied by an update to the docs (similar to 
> https://github.com/apache/spark/pull/34333)
>  
> Related Jira ticket - https://issues.apache.org/jira/browse/SPARK-36533
> Related github PR - https://github.com/apache/spark/pull/33763



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37309) Add available_now to PySpark trigger function

2021-11-14 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-37309.
--
Resolution: Duplicate

I'm resolving this as duplicated. Thanks for reporting!

> Add available_now to PySpark trigger function
> -
>
> Key: SPARK-37309
> URL: https://issues.apache.org/jira/browse/SPARK-37309
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Lennox Stevenson
>Priority: Minor
>
> With the release of the new streaming query trigger called `AvailableNow` in 
> v3.2.0, it would be helpful to also include the trigger as part of the 
> pyspark API. Making the trigger a part of the pyspark API would make it clear 
> that this trigger is an option for python developers. This change to the 
> pyspark API should be accompanied by an update to the docs (similar to 
> https://github.com/apache/spark/pull/34333)
>  
> Related Jira ticket - https://issues.apache.org/jira/browse/SPARK-36533
> Related github PR - https://github.com/apache/spark/pull/33763



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-37309) Add available_now to PySpark trigger function

2021-11-14 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443514#comment-17443514
 ] 

Jungtaek Lim edited comment on SPARK-37309 at 11/15/21, 3:03 AM:
-

This is resolved today via [https://github.com/apache/spark/pull/34592]

Worth noting that Trigger.AvailableNow is yet introduced in released version of 
Apache Spark. It will be available in Spark 3.3.0, so we didn't miss the train 
yet.


was (Author: kabhwan):
This is resolved today via [https://github.com/apache/spark/pull/34592]

Worth noting that Trigger.AvailableNow is yet introduced in released version of 
Apache Spark. This will be available in Spark 3.3.0, so we didn't miss the 
train yet.

> Add available_now to PySpark trigger function
> -
>
> Key: SPARK-37309
> URL: https://issues.apache.org/jira/browse/SPARK-37309
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Lennox Stevenson
>Priority: Minor
>
> With the release of the new streaming query trigger called `AvailableNow` in 
> v3.2.0, it would be helpful to also include the trigger as part of the 
> pyspark API. Making the trigger a part of the pyspark API would make it clear 
> that this trigger is an option for python developers. This change to the 
> pyspark API should be accompanied by an update to the docs (similar to 
> https://github.com/apache/spark/pull/34333)
>  
> Related Jira ticket - https://issues.apache.org/jira/browse/SPARK-36533
> Related github PR - https://github.com/apache/spark/pull/33763



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37309) Add available_now to PySpark trigger function

2021-11-14 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443514#comment-17443514
 ] 

Jungtaek Lim commented on SPARK-37309:
--

This is resolved today via [https://github.com/apache/spark/pull/34592]

Worth noting that Trigger.AvailableNow is yet introduced in released version of 
Apache Spark. This will be available in Spark 3.3.0, so we didn't miss the 
train yet.

> Add available_now to PySpark trigger function
> -
>
> Key: SPARK-37309
> URL: https://issues.apache.org/jira/browse/SPARK-37309
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Lennox Stevenson
>Priority: Minor
>
> With the release of the new streaming query trigger called `AvailableNow` in 
> v3.2.0, it would be helpful to also include the trigger as part of the 
> pyspark API. Making the trigger a part of the pyspark API would make it clear 
> that this trigger is an option for python developers. This change to the 
> pyspark API should be accompanied by an update to the docs (similar to 
> https://github.com/apache/spark/pull/34333)
>  
> Related Jira ticket - https://issues.apache.org/jira/browse/SPARK-36533
> Related github PR - https://github.com/apache/spark/pull/33763



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37276) Support YearMonthIntervalType in Arrow

2021-11-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37276.
--
Resolution: Later

There's no corresponding mapping with Python instance. I am resolving this as a 
Later for now.

> Support YearMonthIntervalType in Arrow
> --
>
> Key: SPARK-37276
> URL: https://issues.apache.org/jira/browse/SPARK-37276
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Implements the support of YearMonthIntervalType in Arrow code path:
> - pandas UDFs
> - pandas functions APIs
> - createDataFrame/toPandas w/ Arrow



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37280) Support YearMonthIntervalType in Py4J

2021-11-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37280.
--
Resolution: Later

There's no corresponding mapping with Python instance. I am resolving this as a 
Later for now.

> Support YearMonthIntervalType in Py4J
> -
>
> Key: SPARK-37280
> URL: https://issues.apache.org/jira/browse/SPARK-37280
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This PR adds the support of YearMonthIntervalType in Py4J. For example, 
> functions.lit(YearMonthIntervalType) should work.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37278) Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs

2021-11-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37278.
--
Resolution: Later

There's no corresponding mapping with Python instance. I am resolving this as a 
Later for now.

> Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs
> -
>
> Key: SPARK-37278
> URL: https://issues.apache.org/jira/browse/SPARK-37278
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Implements the support of YearMonthIntervalType in:
> - Python UDFs
> - createDataFrame/toPandas without Arrow



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37295) illegal reflective access operation has occurred; Please consider reporting this to the maintainers

2021-11-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37295.
--
Resolution: Won't Fix

I'm resolving this - there isn't any easy way to avoid this than upgrading to 
JDK 17.

> illegal reflective access operation has occurred; Please consider reporting 
> this to the maintainers
> ---
>
> Key: SPARK-37295
> URL: https://issues.apache.org/jira/browse/SPARK-37295
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
> Environment: MacBook pro running mac OS 11.6
> spark-3.1.2-bin-hadoop3.2
> it is not clear to me how spark finds java. I believe I also have java 8 
> installed somewhere
> ```
> $ which java
> ~/anaconda3/envs/extraCellularRNA/bin/java
> $ java -version
> openjdk version "11.0.6" 2020-01-14
> OpenJDK Runtime Environment (build 11.0.6+8-b765.1)
> OpenJDK 64-Bit Server VM (build 11.0.6+8-b765.1, mixed mode)
> ```
>  
>Reporter: Andrew Davidson
>Priority: Major
>
> ```
>    spark = SparkSession\
>                 .builder\
>                 .appName("TestEstimatedScalingFactors")\
>                 .getOrCreate()
> ```
> generates the following warning
> ```
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/Users/xxx/googleUCSC/kimLab/extraCellularRNA/terra/deseq/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar)
>  to constructor java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 21/11/11 12:51:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> ```
> I am using pyspark spark-3.1.2-bin-hadoop3.2 on a MacBook pro running mac OS 
> 11.6
>  
> My small unit test see to work okay how ever It fails when I try and run on 
> 3.2.0
>  
> I
>  
> Any idea how I track down this issue? Kind regards
>  
> Andy
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37295) illegal reflective access operation has occurred; Please consider reporting this to the maintainers

2021-11-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37295:
-
Priority: Major  (was: Critical)

> illegal reflective access operation has occurred; Please consider reporting 
> this to the maintainers
> ---
>
> Key: SPARK-37295
> URL: https://issues.apache.org/jira/browse/SPARK-37295
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
> Environment: MacBook pro running mac OS 11.6
> spark-3.1.2-bin-hadoop3.2
> it is not clear to me how spark finds java. I believe I also have java 8 
> installed somewhere
> ```
> $ which java
> ~/anaconda3/envs/extraCellularRNA/bin/java
> $ java -version
> openjdk version "11.0.6" 2020-01-14
> OpenJDK Runtime Environment (build 11.0.6+8-b765.1)
> OpenJDK 64-Bit Server VM (build 11.0.6+8-b765.1, mixed mode)
> ```
>  
>Reporter: Andrew Davidson
>Priority: Major
>
> ```
>    spark = SparkSession\
>                 .builder\
>                 .appName("TestEstimatedScalingFactors")\
>                 .getOrCreate()
> ```
> generates the following warning
> ```
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/Users/xxx/googleUCSC/kimLab/extraCellularRNA/terra/deseq/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar)
>  to constructor java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 21/11/11 12:51:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> ```
> I am using pyspark spark-3.1.2-bin-hadoop3.2 on a MacBook pro running mac OS 
> 11.6
>  
> My small unit test see to work okay how ever It fails when I try and run on 
> 3.2.0
>  
> I
>  
> Any idea how I track down this issue? Kind regards
>  
> Andy
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37309) Add available_now to PySpark trigger function

2021-11-14 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443500#comment-17443500
 ] 

Hyukjin Kwon commented on SPARK-37309:
--

cc [~kabhwan] [~bozhang] FYI

> Add available_now to PySpark trigger function
> -
>
> Key: SPARK-37309
> URL: https://issues.apache.org/jira/browse/SPARK-37309
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Lennox Stevenson
>Priority: Minor
>
> With the release of the new streaming query trigger called `AvailableNow` in 
> v3.2.0, it would be helpful to also include the trigger as part of the 
> pyspark API. Making the trigger a part of the pyspark API would make it clear 
> that this trigger is an option for python developers. This change to the 
> pyspark API should be accompanied by an update to the docs (similar to 
> https://github.com/apache/spark/pull/34333)
>  
> Related Jira ticket - https://issues.apache.org/jira/browse/SPARK-36533
> Related github PR - https://github.com/apache/spark/pull/33763



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37299) Fix Python linter failure in branch-3.1

2021-11-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37299.
--
Resolution: Duplicate

> Fix Python linter failure in branch-3.1
> ---
>
> Key: SPARK-37299
> URL: https://issues.apache.org/jira/browse/SPARK-37299
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Haejoon Lee
>Priority: Major
>
> The Python linter is failing for some reason in branch-3.1.
> [https://github.com/apache/spark/runs/4151205472?check_suite_focus=true]
> We should fix it



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37319) Support K8s image building with Java 17

2021-11-14 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-37319.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved in https://github.com/apache/spark/pull/34586

> Support K8s image building with Java 17
> ---
>
> Key: SPARK-37319
> URL: https://issues.apache.org/jira/browse/SPARK-37319
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35345) Add BloomFilter Benchmark test for Parquet

2021-11-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443498#comment-17443498
 ] 

Apache Spark commented on SPARK-35345:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34594

> Add BloomFilter Benchmark test for Parquet
> --
>
> Key: SPARK-35345
> URL: https://issues.apache.org/jira/browse/SPARK-35345
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Huaxin Gao
>Priority: Trivial
>
> Currently, we only have BloomFilter Benchmark test for ORC. Will add one for 
> Parquet too.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35345) Add BloomFilter Benchmark test for Parquet

2021-11-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443497#comment-17443497
 ] 

Apache Spark commented on SPARK-35345:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34594

> Add BloomFilter Benchmark test for Parquet
> --
>
> Key: SPARK-35345
> URL: https://issues.apache.org/jira/browse/SPARK-35345
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Huaxin Gao
>Priority: Trivial
>
> Currently, we only have BloomFilter Benchmark test for ORC. Will add one for 
> Parquet too.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37228) Implement DataFrame.mapInArrow in Python

2021-11-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37228.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34505
[https://github.com/apache/spark/pull/34505]

> Implement DataFrame.mapInArrow in Python
> 
>
> Key: SPARK-37228
> URL: https://issues.apache.org/jira/browse/SPARK-37228
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37228) Implement DataFrame.mapInArrow in Python

2021-11-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37228:


Assignee: Hyukjin Kwon

> Implement DataFrame.mapInArrow in Python
> 
>
> Key: SPARK-37228
> URL: https://issues.apache.org/jira/browse/SPARK-37228
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37323) Pin `docutils` in branch-3.1

2021-11-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37323:


Assignee: Dongjoon Hyun

> Pin `docutils` in branch-3.1
> 
>
> Key: SPARK-37323
> URL: https://issues.apache.org/jira/browse/SPARK-37323
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.3
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> `docutils` 0.18 is released October 26 and causes Python linter failure in 
> branch-3.1.
> - https://pypi.org/project/docutils/#history
> - https://github.com/apache/spark/commits/branch-3.1
> {code}
> Exception occurred:
>   File 
> "/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/site-packages/docutils/writers/html5_polyglot/__init__.py",
>  line 445, in section_title_tags
> if (ids and self.settings.section_self_link
> AttributeError: 'Values' object has no attribute 'section_self_link'
> The full traceback has been saved in 
> /var/folders/mq/c32xpgtj4tj19vt8b10wp8rcgn/T/sphinx-err-2h05ytx2.log, if 
> you want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [html] Error 2
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37323) Pin `docutils` in branch-3.1

2021-11-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37323.
--
Fix Version/s: 3.1.3
   Resolution: Fixed

Issue resolved by pull request 34591
[https://github.com/apache/spark/pull/34591]

> Pin `docutils` in branch-3.1
> 
>
> Key: SPARK-37323
> URL: https://issues.apache.org/jira/browse/SPARK-37323
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.3
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.3
>
>
> `docutils` 0.18 is released October 26 and causes Python linter failure in 
> branch-3.1.
> - https://pypi.org/project/docutils/#history
> - https://github.com/apache/spark/commits/branch-3.1
> {code}
> Exception occurred:
>   File 
> "/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/site-packages/docutils/writers/html5_polyglot/__init__.py",
>  line 445, in section_title_tags
> if (ids and self.settings.section_self_link
> AttributeError: 'Values' object has no attribute 'section_self_link'
> The full traceback has been saved in 
> /var/folders/mq/c32xpgtj4tj19vt8b10wp8rcgn/T/sphinx-err-2h05ytx2.log, if 
> you want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [html] Error 2
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37324) Support Decimal RoundingMode.UP, DOWN, HALF_DOWN

2021-11-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443488#comment-17443488
 ] 

Apache Spark commented on SPARK-37324:
--

User 'sathiyapk' has created a pull request for this issue:
https://github.com/apache/spark/pull/34593

> Support Decimal RoundingMode.UP, DOWN, HALF_DOWN
> 
>
> Key: SPARK-37324
> URL: https://issues.apache.org/jira/browse/SPARK-37324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> Currently we support only Decimal RoundingModes : HALF_UP (round) and 
> HALF_EVEN (bround). But we have use cases that needs RoundingMode.UP and 
> RoundingMode.DOWN. In our projects we use UDF, i also see few people do 
> complex operations to do the same with spark native methods.
> [https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]
> [https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]
> [https://stackoverflow.com/questions/48279641/oracle-sql-round-half]
>  
> Opening support for the other rounding modes might interest a lot of use 
> cases. Sql Server does something similar to this : 
> [https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36533) Allow streaming queries with Trigger.Once run in multiple batches

2021-11-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443489#comment-17443489
 ] 

Apache Spark commented on SPARK-36533:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/34592

> Allow streaming queries with Trigger.Once run in multiple batches
> -
>
> Key: SPARK-36533
> URL: https://issues.apache.org/jira/browse/SPARK-36533
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently streaming queries with Trigger.Once will always load all of the 
> available data in a single batch. Because of this, the amount of data the 
> queries can process is limited, or Spark driver will be out of memory. 
> We should allow streaming queries with Trigger.Once run in multiple batches.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36533) Allow streaming queries with Trigger.Once run in multiple batches

2021-11-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443487#comment-17443487
 ] 

Apache Spark commented on SPARK-36533:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/34592

> Allow streaming queries with Trigger.Once run in multiple batches
> -
>
> Key: SPARK-36533
> URL: https://issues.apache.org/jira/browse/SPARK-36533
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently streaming queries with Trigger.Once will always load all of the 
> available data in a single batch. Because of this, the amount of data the 
> queries can process is limited, or Spark driver will be out of memory. 
> We should allow streaming queries with Trigger.Once run in multiple batches.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37324) Support Decimal RoundingMode.UP, DOWN, HALF_DOWN

2021-11-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37324:


Assignee: (was: Apache Spark)

> Support Decimal RoundingMode.UP, DOWN, HALF_DOWN
> 
>
> Key: SPARK-37324
> URL: https://issues.apache.org/jira/browse/SPARK-37324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> Currently we support only Decimal RoundingModes : HALF_UP (round) and 
> HALF_EVEN (bround). But we have use cases that needs RoundingMode.UP and 
> RoundingMode.DOWN. In our projects we use UDF, i also see few people do 
> complex operations to do the same with spark native methods.
> [https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]
> [https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]
> [https://stackoverflow.com/questions/48279641/oracle-sql-round-half]
>  
> Opening support for the other rounding modes might interest a lot of use 
> cases. Sql Server does something similar to this : 
> [https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37324) Support Decimal RoundingMode.UP, DOWN, HALF_DOWN

2021-11-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443486#comment-17443486
 ] 

Apache Spark commented on SPARK-37324:
--

User 'sathiyapk' has created a pull request for this issue:
https://github.com/apache/spark/pull/34593

> Support Decimal RoundingMode.UP, DOWN, HALF_DOWN
> 
>
> Key: SPARK-37324
> URL: https://issues.apache.org/jira/browse/SPARK-37324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> Currently we support only Decimal RoundingModes : HALF_UP (round) and 
> HALF_EVEN (bround). But we have use cases that needs RoundingMode.UP and 
> RoundingMode.DOWN. In our projects we use UDF, i also see few people do 
> complex operations to do the same with spark native methods.
> [https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]
> [https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]
> [https://stackoverflow.com/questions/48279641/oracle-sql-round-half]
>  
> Opening support for the other rounding modes might interest a lot of use 
> cases. Sql Server does something similar to this : 
> [https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37324) Support Decimal RoundingMode.UP, DOWN, HALF_DOWN

2021-11-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37324:


Assignee: Apache Spark

> Support Decimal RoundingMode.UP, DOWN, HALF_DOWN
> 
>
> Key: SPARK-37324
> URL: https://issues.apache.org/jira/browse/SPARK-37324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sathiya Kumar
>Assignee: Apache Spark
>Priority: Minor
>
> Currently we support only Decimal RoundingModes : HALF_UP (round) and 
> HALF_EVEN (bround). But we have use cases that needs RoundingMode.UP and 
> RoundingMode.DOWN. In our projects we use UDF, i also see few people do 
> complex operations to do the same with spark native methods.
> [https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]
> [https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]
> [https://stackoverflow.com/questions/48279641/oracle-sql-round-half]
>  
> Opening support for the other rounding modes might interest a lot of use 
> cases. Sql Server does something similar to this : 
> [https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37324) Support Decimal RoundingMode.UP, DOWN, HALF_DOWN

2021-11-14 Thread Sathiya Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sathiya Kumar updated SPARK-37324:
--
Description: 
Currently we support only Decimal RoundingModes : HALF_UP (round) and HALF_EVEN 
(bround). But we have use cases that needs RoundingMode.UP and 
RoundingMode.DOWN. In our projects we use UDF, i also see few people do complex 
operations to do the same with spark native methods.

[https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]

[https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]

[https://stackoverflow.com/questions/48279641/oracle-sql-round-half]

 

Opening support for the other rounding modes might interest a lot of use cases. 
Sql Server does something similar to this : 
[https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15]
 

  was:Currently we have support for 


> Support Decimal RoundingMode.UP, DOWN, HALF_DOWN
> 
>
> Key: SPARK-37324
> URL: https://issues.apache.org/jira/browse/SPARK-37324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> Currently we support only Decimal RoundingModes : HALF_UP (round) and 
> HALF_EVEN (bround). But we have use cases that needs RoundingMode.UP and 
> RoundingMode.DOWN. In our projects we use UDF, i also see few people do 
> complex operations to do the same with spark native methods.
> [https://stackoverflow.com/questions/34888419/round-down-double-in-spark/40476117]
> [https://stackoverflow.com/questions/54683066/is-there-a-rounddown-function-in-sql-as-there-is-in-excel]
> [https://stackoverflow.com/questions/48279641/oracle-sql-round-half]
>  
> Opening support for the other rounding modes might interest a lot of use 
> cases. Sql Server does something similar to this : 
> [https://docs.microsoft.com/en-us/sql/t-sql/functions/round-transact-sql?view=sql-server-ver15]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37323) Pin `docutils` in branch-3.1

2021-11-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37323:


Assignee: Apache Spark

> Pin `docutils` in branch-3.1
> 
>
> Key: SPARK-37323
> URL: https://issues.apache.org/jira/browse/SPARK-37323
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.3
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> `docutils` 0.18 is released October 26 and causes Python linter failure in 
> branch-3.1.
> - https://pypi.org/project/docutils/#history
> - https://github.com/apache/spark/commits/branch-3.1
> {code}
> Exception occurred:
>   File 
> "/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/site-packages/docutils/writers/html5_polyglot/__init__.py",
>  line 445, in section_title_tags
> if (ids and self.settings.section_self_link
> AttributeError: 'Values' object has no attribute 'section_self_link'
> The full traceback has been saved in 
> /var/folders/mq/c32xpgtj4tj19vt8b10wp8rcgn/T/sphinx-err-2h05ytx2.log, if 
> you want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [html] Error 2
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37323) Pin `docutils` in branch-3.1

2021-11-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443460#comment-17443460
 ] 

Apache Spark commented on SPARK-37323:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34591

> Pin `docutils` in branch-3.1
> 
>
> Key: SPARK-37323
> URL: https://issues.apache.org/jira/browse/SPARK-37323
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.3
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `docutils` 0.18 is released October 26 and causes Python linter failure in 
> branch-3.1.
> - https://pypi.org/project/docutils/#history
> - https://github.com/apache/spark/commits/branch-3.1
> {code}
> Exception occurred:
>   File 
> "/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/site-packages/docutils/writers/html5_polyglot/__init__.py",
>  line 445, in section_title_tags
> if (ids and self.settings.section_self_link
> AttributeError: 'Values' object has no attribute 'section_self_link'
> The full traceback has been saved in 
> /var/folders/mq/c32xpgtj4tj19vt8b10wp8rcgn/T/sphinx-err-2h05ytx2.log, if 
> you want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [html] Error 2
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37323) Pin `docutils` in branch-3.1

2021-11-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37323:


Assignee: (was: Apache Spark)

> Pin `docutils` in branch-3.1
> 
>
> Key: SPARK-37323
> URL: https://issues.apache.org/jira/browse/SPARK-37323
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.3
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `docutils` 0.18 is released October 26 and causes Python linter failure in 
> branch-3.1.
> - https://pypi.org/project/docutils/#history
> - https://github.com/apache/spark/commits/branch-3.1
> {code}
> Exception occurred:
>   File 
> "/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/site-packages/docutils/writers/html5_polyglot/__init__.py",
>  line 445, in section_title_tags
> if (ids and self.settings.section_self_link
> AttributeError: 'Values' object has no attribute 'section_self_link'
> The full traceback has been saved in 
> /var/folders/mq/c32xpgtj4tj19vt8b10wp8rcgn/T/sphinx-err-2h05ytx2.log, if 
> you want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [html] Error 2
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37324) Support Decimal RoundingMode.UP, DOWN, HALF_DOWN

2021-11-14 Thread Sathiya Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sathiya Kumar updated SPARK-37324:
--
Description: Currently we have support for 

> Support Decimal RoundingMode.UP, DOWN, HALF_DOWN
> 
>
> Key: SPARK-37324
> URL: https://issues.apache.org/jira/browse/SPARK-37324
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Sathiya Kumar
>Priority: Minor
>
> Currently we have support for 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37324) Support Decimal RoundingMode.UP, DOWN, HALF_DOWN

2021-11-14 Thread Sathiya Kumar (Jira)
Sathiya Kumar created SPARK-37324:
-

 Summary: Support Decimal RoundingMode.UP, DOWN, HALF_DOWN
 Key: SPARK-37324
 URL: https://issues.apache.org/jira/browse/SPARK-37324
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: Sathiya Kumar






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37323) Pin `docutils` in branch-3.1

2021-11-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37323:
--
Description: 
`docutils` 0.18 is released October 26 and causes Python linter failure in 
branch-3.1.
- https://pypi.org/project/docutils/#history
- https://github.com/apache/spark/commits/branch-3.1

{code}
Exception occurred:
  File 
"/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/site-packages/docutils/writers/html5_polyglot/__init__.py",
 line 445, in section_title_tags
if (ids and self.settings.section_self_link
AttributeError: 'Values' object has no attribute 'section_self_link'
The full traceback has been saved in 
/var/folders/mq/c32xpgtj4tj19vt8b10wp8rcgn/T/sphinx-err-2h05ytx2.log, if 
you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message 
can be provided next time.
A bug report can be filed in the tracker at 
. Thanks!
make: *** [html] Error 2
{code}

  was:
`docutils` 0.18 is released October 26 and causes Python linter failure in 
branch-3.1.
- https://pypi.org/project/docutils/#history

{code}
Exception occurred:
  File 
"/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/site-packages/docutils/writers/html5_polyglot/__init__.py",
 line 445, in section_title_tags
if (ids and self.settings.section_self_link
AttributeError: 'Values' object has no attribute 'section_self_link'
The full traceback has been saved in 
/var/folders/mq/c32xpgtj4tj19vt8b10wp8rcgn/T/sphinx-err-2h05ytx2.log, if 
you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message 
can be provided next time.
A bug report can be filed in the tracker at 
. Thanks!
make: *** [html] Error 2
{code}


> Pin `docutils` in branch-3.1
> 
>
> Key: SPARK-37323
> URL: https://issues.apache.org/jira/browse/SPARK-37323
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.3
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `docutils` 0.18 is released October 26 and causes Python linter failure in 
> branch-3.1.
> - https://pypi.org/project/docutils/#history
> - https://github.com/apache/spark/commits/branch-3.1
> {code}
> Exception occurred:
>   File 
> "/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/site-packages/docutils/writers/html5_polyglot/__init__.py",
>  line 445, in section_title_tags
> if (ids and self.settings.section_self_link
> AttributeError: 'Values' object has no attribute 'section_self_link'
> The full traceback has been saved in 
> /var/folders/mq/c32xpgtj4tj19vt8b10wp8rcgn/T/sphinx-err-2h05ytx2.log, if 
> you want to report the issue to the developers.
> Please also report this if it was a user error, so that a better error 
> message can be provided next time.
> A bug report can be filed in the tracker at 
> . Thanks!
> make: *** [html] Error 2
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37323) Pin `docutils` in branch-3.1

2021-11-14 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-37323:
-

 Summary: Pin `docutils` in branch-3.1
 Key: SPARK-37323
 URL: https://issues.apache.org/jira/browse/SPARK-37323
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.1.3
Reporter: Dongjoon Hyun


`docutils` 0.18 is released October 26 and causes Python linter failure in 
branch-3.1.
- https://pypi.org/project/docutils/#history

{code}
Exception occurred:
  File 
"/Users/dongjoon/.pyenv/versions/3.10.0/lib/python3.10/site-packages/docutils/writers/html5_polyglot/__init__.py",
 line 445, in section_title_tags
if (ids and self.settings.section_self_link
AttributeError: 'Values' object has no attribute 'section_self_link'
The full traceback has been saved in 
/var/folders/mq/c32xpgtj4tj19vt8b10wp8rcgn/T/sphinx-err-2h05ytx2.log, if 
you want to report the issue to the developers.
Please also report this if it was a user error, so that a better error message 
can be provided next time.
A bug report can be filed in the tracker at 
. Thanks!
make: *** [html] Error 2
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37289) Refactoring:remove the unnecessary function with partitionSchemaOption

2021-11-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37289:
--
Affects Version/s: (was: 3.2.0)
   3.3.0

> Refactoring:remove the unnecessary function with partitionSchemaOption 
> ---
>
> Key: SPARK-37289
> URL: https://issues.apache.org/jira/browse/SPARK-37289
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: tenglei
>Assignee: tenglei
>Priority: Minor
> Fix For: 3.3.0
>
>
> the partitionSchemaOption in HadoopFsRelation is unnecessary as we can simply 
> use 
> partitionSchema.isEmpty or partitionSchema.nonEmpty



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37289) Refactoring:remove the unnecessary function with partitionSchemaOption

2021-11-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37289:
-

Assignee: tenglei

> Refactoring:remove the unnecessary function with partitionSchemaOption 
> ---
>
> Key: SPARK-37289
> URL: https://issues.apache.org/jira/browse/SPARK-37289
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: tenglei
>Assignee: tenglei
>Priority: Minor
>
> the partitionSchemaOption in HadoopFsRelation is unnecessary as we can simply 
> use 
> partitionSchema.isEmpty or partitionSchema.nonEmpty



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37289) Refactoring:remove the unnecessary function with partitionSchemaOption

2021-11-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37289.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34582
[https://github.com/apache/spark/pull/34582]

> Refactoring:remove the unnecessary function with partitionSchemaOption 
> ---
>
> Key: SPARK-37289
> URL: https://issues.apache.org/jira/browse/SPARK-37289
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: tenglei
>Assignee: tenglei
>Priority: Minor
> Fix For: 3.3.0
>
>
> the partitionSchemaOption in HadoopFsRelation is unnecessary as we can simply 
> use 
> partitionSchema.isEmpty or partitionSchema.nonEmpty



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37322) `run_scala_tests` should respect test module order

2021-11-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37322:
-

Assignee: William Hyun

> `run_scala_tests` should respect test module order
> --
>
> Key: SPARK-37322
> URL: https://issues.apache.org/jira/browse/SPARK-37322
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.1.3, 3.2.1, 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37322) `run_scala_tests` should respect test module order

2021-11-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37322.
---
Fix Version/s: 3.2.1
   3.1.3
   3.3.0
   Resolution: Fixed

Issue resolved by pull request 34590
[https://github.com/apache/spark/pull/34590]

> `run_scala_tests` should respect test module order
> --
>
> Key: SPARK-37322
> URL: https://issues.apache.org/jira/browse/SPARK-37322
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37322) `run_scala_tests` should respect test module order

2021-11-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443425#comment-17443425
 ] 

Apache Spark commented on SPARK-37322:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34590

> `run_scala_tests` should respect test module order
> --
>
> Key: SPARK-37322
> URL: https://issues.apache.org/jira/browse/SPARK-37322
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37322) `run_scala_tests` should respect test module order

2021-11-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37322:


Assignee: (was: Apache Spark)

> `run_scala_tests` should respect test module order
> --
>
> Key: SPARK-37322
> URL: https://issues.apache.org/jira/browse/SPARK-37322
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37322) `run_scala_tests` should respect test module order

2021-11-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37322:


Assignee: Apache Spark

> `run_scala_tests` should respect test module order
> --
>
> Key: SPARK-37322
> URL: https://issues.apache.org/jira/browse/SPARK-37322
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37322) `run_scala_tests` should respect test module order

2021-11-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443424#comment-17443424
 ] 

Apache Spark commented on SPARK-37322:
--

User 'williamhyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34590

> `run_scala_tests` should respect test module order
> --
>
> Key: SPARK-37322
> URL: https://issues.apache.org/jira/browse/SPARK-37322
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: William Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37322) `run_scala_tests` should respect test module order

2021-11-14 Thread William Hyun (Jira)
William Hyun created SPARK-37322:


 Summary: `run_scala_tests` should respect test module order
 Key: SPARK-37322
 URL: https://issues.apache.org/jira/browse/SPARK-37322
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.3.0
Reporter: William Hyun






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37321) Wrong size estimation leads to "Cannot broadcast the table that is larger than 8GB: 8 GB"

2021-11-14 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-37321:

Description: 
When CBO is enabled then a situation occurs where spark tries to broadcast very 
large DataFrame due to wrong output size estimation.

 

In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will 
use `DataType.defaultSize`.

In the case where the output contains `functions.concat_ws`, the 
`getSizePerRow` function will estimate the size to be 20 bytes, while in our 
case the actual size can be a lot larger.

As a result, we in some cases end up with an estimated size of < 300K while the 
actual size can be > 8GB, thus leading to exceptions as spark thinks the tables 
may be broadcast but later realizes the data size is too large.

 

Code sample to reproduce:

for running that I used `-Xmx45G`
{code:scala}
import spark.implicits._

(1 to 10).toDF("index").withColumn("index", 
col("index").cast("string")).write.parquet("/tmp/a")
(1 to 1000).toDF("index_b").withColumn("index_b", 
col("index_b").cast("string")).write.parquet("/tmp/b")

val a = spark.read
   .parquet("/tmp/a")
   .withColumn("b", col("index"))
   .withColumn("l1", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l2", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l3", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l4", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l5", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))

val r = Random.alphanumeric
val l = 220
val i = 2800

val b = spark.read
   .parquet("/tmp/b")
   .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
 
a.join(b, col("index") === col("index_b")).show(2000)
{code}
 

  was:
When CBO is enabled then a situation occurs where spark tries to broadcast very 
large DataFrame due to wrong output size estimation.

 

In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will 
use `DataType.defaultSize`.

In the case where the output contains `functions.concat_ws`, the 
`getSizePerRow` function will estimate the size to be 20 bytes, while in our 
case the actual size can be a lot larger.

As a result, we in some cases end up with an estimated size of < 300K while the 
actual size can be > 8GB, thus leading to exceptions as spark thinks the tables 
may be broadcast but later realizes the data size is too large.

 

Code sample to reproduce:
{code:scala}
import spark.implicits._

(1 to 10).toDF("index").withColumn("index", 
col("index").cast("string")).write.parquet("/tmp/a")
(1 to 1000).toDF("index_b").withColumn("index_b", 
col("index_b").cast("string")).write.parquet("/tmp/b")

val a = spark.read
   .parquet("/tmp/a")
   .withColumn("b", col("index"))
   .withColumn("l1", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l2", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l3", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l4", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l

[jira] [Updated] (SPARK-37317) Reduce weights in GaussianMixtureSuite

2021-11-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-37317:
--
Fix Version/s: 3.2.1

> Reduce weights in GaussianMixtureSuite
> --
>
> Key: SPARK-37317
> URL: https://issues.apache.org/jira/browse/SPARK-37317
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.1, 3.3.0
>
>
> {code}
> $ build/sbt "mllib/test"
> ...
> [info] *** 1 TEST FAILED ***
> [error] Failed: Total 1760, Failed 1, Errors 0, Passed 1759, Ignored 7
> [error] Failed tests:
> [error]   org.apache.spark.ml.clustering.GaussianMixtureSuite
> [error] (mllib / Test / test) sbt.TestsFailedException: Tests unsuccessful
> [error] Total time: 625 s (10:25), completed Nov 13, 2021, 6:21:13 PM
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37320) Delete py_container_checks.zip after the test in DepsTestsSuite finishes

2021-11-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37320.
---
Fix Version/s: 3.3.0
   3.2.1
   3.1.3
   Resolution: Fixed

Issue resolved by pull request 34588
[https://github.com/apache/spark/pull/34588]

> Delete py_container_checks.zip after the test in DepsTestsSuite finishes
> 
>
> Key: SPARK-37320
> URL: https://issues.apache.org/jira/browse/SPARK-37320
> Project: Spark
>  Issue Type: Bug
>  Components: k8, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
>
> When K8s integration tests run, py_container_checks.zip  still remains in 
> resource-managers/kubernetes/integration-tests/tests/.
> It's is created in the test "Launcher python client dependencies using a zip 
> file" in DepsTestsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37317) Reduce weights in GaussianMixtureSuite

2021-11-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37317.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34584
[https://github.com/apache/spark/pull/34584]

> Reduce weights in GaussianMixtureSuite
> --
>
> Key: SPARK-37317
> URL: https://issues.apache.org/jira/browse/SPARK-37317
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, Tests
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>
> {code}
> $ build/sbt "mllib/test"
> ...
> [info] *** 1 TEST FAILED ***
> [error] Failed: Total 1760, Failed 1, Errors 0, Passed 1759, Ignored 7
> [error] Failed tests:
> [error]   org.apache.spark.ml.clustering.GaussianMixtureSuite
> [error] (mllib / Test / test) sbt.TestsFailedException: Tests unsuccessful
> [error] Total time: 625 s (10:25), completed Nov 13, 2021, 6:21:13 PM
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32567) Code-gen for full outer shuffled hash join

2021-11-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443369#comment-17443369
 ] 

Apache Spark commented on SPARK-32567:
--

User 'jerqi' has created a pull request for this issue:
https://github.com/apache/spark/pull/34589

> Code-gen for full outer shuffled hash join
> --
>
> Key: SPARK-32567
> URL: https://issues.apache.org/jira/browse/SPARK-32567
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
>
> As a followup for [https://github.com/apache/spark/pull/29342] (non-codegen 
> full outer shuffled hash join), this task is to add code-gen for it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37321) Wrong size estimation that leads to "Cannot broadcast the table that is larger than 8GB: 8 GB"

2021-11-14 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-37321:

Summary: Wrong size estimation that leads to "Cannot broadcast the table 
that is larger than 8GB: 8 GB"  (was: Wrong size estimation that leads to 
Cannot broadcast the table that is larger than 8GB: 8 GB)

> Wrong size estimation that leads to "Cannot broadcast the table that is 
> larger than 8GB: 8 GB"
> --
>
> Key: SPARK-37321
> URL: https://issues.apache.org/jira/browse/SPARK-37321
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Izek Greenfield
>Priority: Major
>
> When CBO is enabled then a situation occurs where spark tries to broadcast 
> very large DataFrame due to wrong output size estimation.
>  
> In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will 
> use `DataType.defaultSize`.
> In the case where the output contains `functions.concat_ws`, the 
> `getSizePerRow` function will estimate the size to be 20 bytes, while in our 
> case the actual size can be a lot larger.
> As a result, we in some cases end up with an estimated size of < 300K while 
> the actual size can be > 8GB, thus leading to exceptions as spark thinks the 
> tables may be broadcast but later realizes the data size is too large.
>  
> Code sample to reproduce:
> {code:scala}
> import spark.implicits._
> (1 to 10).toDF("index").withColumn("index", 
> col("index").cast("string")).write.parquet("/tmp/a")
> (1 to 1000).toDF("index_b").withColumn("index_b", 
> col("index_b").cast("string")).write.parquet("/tmp/b")
> val a = spark.read
>    .parquet("/tmp/a")
>    .withColumn("b", col("index"))
>    .withColumn("l1", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l2", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l3", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l4", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l5", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
> val r = Random.alphanumeric
> val l = 220
> val i = 2800
> val b = spark.read
>    .parquet("/tmp/b")
>    .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>  
> a.join(b, col("index") === col("index_b")).show(2000)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37321) Wrong size estimation leads to "Cannot broadcast the table that is larger than 8GB: 8 GB"

2021-11-14 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-37321:

Component/s: SQL

> Wrong size estimation leads to "Cannot broadcast the table that is larger 
> than 8GB: 8 GB"
> -
>
> Key: SPARK-37321
> URL: https://issues.apache.org/jira/browse/SPARK-37321
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Izek Greenfield
>Priority: Major
>
> When CBO is enabled then a situation occurs where spark tries to broadcast 
> very large DataFrame due to wrong output size estimation.
>  
> In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will 
> use `DataType.defaultSize`.
> In the case where the output contains `functions.concat_ws`, the 
> `getSizePerRow` function will estimate the size to be 20 bytes, while in our 
> case the actual size can be a lot larger.
> As a result, we in some cases end up with an estimated size of < 300K while 
> the actual size can be > 8GB, thus leading to exceptions as spark thinks the 
> tables may be broadcast but later realizes the data size is too large.
>  
> Code sample to reproduce:
> {code:scala}
> import spark.implicits._
> (1 to 10).toDF("index").withColumn("index", 
> col("index").cast("string")).write.parquet("/tmp/a")
> (1 to 1000).toDF("index_b").withColumn("index_b", 
> col("index_b").cast("string")).write.parquet("/tmp/b")
> val a = spark.read
>    .parquet("/tmp/a")
>    .withColumn("b", col("index"))
>    .withColumn("l1", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l2", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l3", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l4", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l5", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
> val r = Random.alphanumeric
> val l = 220
> val i = 2800
> val b = spark.read
>    .parquet("/tmp/b")
>    .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>  
> a.join(b, col("index") === col("index_b")).show(2000)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37321) Wrong size estimation leads to "Cannot broadcast the table that is larger than 8GB: 8 GB"

2021-11-14 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-37321:

Summary: Wrong size estimation leads to "Cannot broadcast the table that is 
larger than 8GB: 8 GB"  (was: Wrong size estimation that leads to "Cannot 
broadcast the table that is larger than 8GB: 8 GB")

> Wrong size estimation leads to "Cannot broadcast the table that is larger 
> than 8GB: 8 GB"
> -
>
> Key: SPARK-37321
> URL: https://issues.apache.org/jira/browse/SPARK-37321
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Izek Greenfield
>Priority: Major
>
> When CBO is enabled then a situation occurs where spark tries to broadcast 
> very large DataFrame due to wrong output size estimation.
>  
> In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will 
> use `DataType.defaultSize`.
> In the case where the output contains `functions.concat_ws`, the 
> `getSizePerRow` function will estimate the size to be 20 bytes, while in our 
> case the actual size can be a lot larger.
> As a result, we in some cases end up with an estimated size of < 300K while 
> the actual size can be > 8GB, thus leading to exceptions as spark thinks the 
> tables may be broadcast but later realizes the data size is too large.
>  
> Code sample to reproduce:
> {code:scala}
> import spark.implicits._
> (1 to 10).toDF("index").withColumn("index", 
> col("index").cast("string")).write.parquet("/tmp/a")
> (1 to 1000).toDF("index_b").withColumn("index_b", 
> col("index_b").cast("string")).write.parquet("/tmp/b")
> val a = spark.read
>    .parquet("/tmp/a")
>    .withColumn("b", col("index"))
>    .withColumn("l1", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l2", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l3", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l4", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
>    .withColumn("l5", functions.concat_ws("/", col("index"), 
> functions.current_date(), functions.current_date(), functions.current_date(), 
> functions.current_date()))
> val r = Random.alphanumeric
> val l = 220
> val i = 2800
> val b = spark.read
>    .parquet("/tmp/b")
>    .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>    .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => 
> List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
>  
> a.join(b, col("index") === col("index_b")).show(2000)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37321) Wrong size estimation that leads to Cannot broadcast the table that is larger than 8GB: 8 GB

2021-11-14 Thread Izek Greenfield (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-37321:

Description: 
When CBO is enabled then a situation occurs where spark tries to broadcast very 
large DataFrame due to wrong output size estimation.

 

In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will 
use `DataType.defaultSize`.

In the case where the output contains `functions.concat_ws`, the 
`getSizePerRow` function will estimate the size to be 20 bytes, while in our 
case the actual size can be a lot larger.

As a result, we in some cases end up with an estimated size of < 300K while the 
actual size can be > 8GB, thus leading to exceptions as spark thinks the tables 
may be broadcast but later realizes the data size is too large.

 

Code sample to reproduce:
{code:scala}
import spark.implicits._

(1 to 10).toDF("index").withColumn("index", 
col("index").cast("string")).write.parquet("/tmp/a")
(1 to 1000).toDF("index_b").withColumn("index_b", 
col("index_b").cast("string")).write.parquet("/tmp/b")

val a = spark.read
   .parquet("/tmp/a")
   .withColumn("b", col("index"))
   .withColumn("l1", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l2", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l3", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l4", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l5", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))

val r = Random.alphanumeric
val l = 220
val i = 2800

val b = spark.read
   .parquet("/tmp/b")
   .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
 
a.join(b, col("index") === col("index_b")).show(2000)
{code}
 

  was:
When CBO is enabled then a situation occurs where spark tries to broadcast very 
large DataFrame due to wrong output size estimation.

 

In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will 
use `DataType.defaultSize`.

In the case where the output contains `functions.concat_ws`, the 
`getSizePerRow` function will estimate the size to be 20 bytes, while in our 
case the actual size can be a lot larger.

As a result, we in some cases end up with an estimated size of < 300K while the 
actual size can be > 8GB, thus leading to exceptions as spark thinks the tables 
may be broadcast but later realizes the data size is too large.

 

Code sample to reproduce:
{code:scala}
import spark.implicits._

(1 to 10).toDF("index").withColumn("index", 
col("index").cast("string")).write.parquet("/tmp/a")

(1 to 1000).toDF("index_b").withColumn("index_b", 
col("index_b").cast("string")).write.parquet("/tmp/b")

val a = spark.read
   .parquet("/tmp/a")
   .withColumn("b", col("index"))
   .withColumn("l1", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l2", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l3", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l4", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l5", functions.concat_ws("/", col("

[jira] [Created] (SPARK-37321) Wrong size estimation that leads to Cannot broadcast the table that is larger than 8GB: 8 GB

2021-11-14 Thread Izek Greenfield (Jira)
Izek Greenfield created SPARK-37321:
---

 Summary: Wrong size estimation that leads to Cannot broadcast the 
table that is larger than 8GB: 8 GB
 Key: SPARK-37321
 URL: https://issues.apache.org/jira/browse/SPARK-37321
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 3.2.0, 3.1.1
Reporter: Izek Greenfield


When CBO is enabled then a situation occurs where spark tries to broadcast very 
large DataFrame due to wrong output size estimation.

 

In `EstimationUtils.getSizePerRow`, if there is no statistics then spark will 
use `DataType.defaultSize`.

In the case where the output contains `functions.concat_ws`, the 
`getSizePerRow` function will estimate the size to be 20 bytes, while in our 
case the actual size can be a lot larger.

As a result, we in some cases end up with an estimated size of < 300K while the 
actual size can be > 8GB, thus leading to exceptions as spark thinks the tables 
may be broadcast but later realizes the data size is too large.

 

Code sample to reproduce:
{code:scala}
import spark.implicits._

(1 to 10).toDF("index").withColumn("index", 
col("index").cast("string")).write.parquet("/tmp/a")

(1 to 1000).toDF("index_b").withColumn("index_b", 
col("index_b").cast("string")).write.parquet("/tmp/b")

val a = spark.read
   .parquet("/tmp/a")
   .withColumn("b", col("index"))
   .withColumn("l1", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l2", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l3", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l4", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))
   .withColumn("l5", functions.concat_ws("/", col("index"), 
functions.current_date(), functions.current_date(), functions.current_date(), 
functions.current_date()))

val r = Random.alphanumeric
val l = 220
val i = 2800

val b = spark.read
   .parquet("/tmp/b")
   .withColumn("l1", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l2", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l3", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l4", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l5", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l6", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
   .withColumn("l7", functions.concat_ws("/", (0 to i).flatMap(a => 
List(col("index_b"), lit(r.take(l).mkString), lit(r.take(l).mkString))): _*))
 
a.join(b, col("index") === col("index_b")).show(2000)
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36934) Timestamp are written as array bytes.

2021-11-14 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-36934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443343#comment-17443343
 ] 

Bjørn Jørgensen commented on SPARK-36934:
-

This is now fixed in Apache Drill 

https://issues.apache.org/jira/browse/DRILL-8007?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17443239#comment-17443239
 

> Timestamp are written as array bytes.
> -
>
> Key: SPARK-36934
> URL: https://issues.apache.org/jira/browse/SPARK-36934
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> This is tested with master build 04.10.21
> {code}
> df = ps.DataFrame({'year': ['2015-2-4', '2016-3-5'],
>                    'month': [2, 3],
>                    'day': [4, 5],
>                   'test': [1, 2]})  
> df["year"] = ps.to_datetime(df["year"]) 
> df.info() 
>  Int64Index: 2 entries, 0 to 1 Data 
> columns (total 4 columns): # Column Non-Null Count Dtype --- -- 
> -- - 0 year 2 non-null datetime64 1 month 2 non-null int64 2 
> day 2 non-null int64 3 test 2 non-null int64 dtypes: datetime64(1), int64(3)  
> spark_df_date = df.to_spark() 
> spark_df_date.printSchema() 
> root
> |-- year: timestamp (nullable = true)
> |-- month: long (nullable = false)
> |-- day: long (nullable = false)
> |-- test: long (nullable = false)  
> spark_df_date.write.parquet("s3a://falk0509/spark_df_date.parquet")  
> {code}
> Load the files in to Apache drill I use docker apache/drill:master-openjdk-14 
>  
> SELECT * FROM cp.`/data/spark_df_date.*`  
> It print's
> year
> {code}
> \x00\x00\x00\x00\x00\x00\x00\x00\xE2}%\x00
> \x00\x00\x00\x00\x00\x00\x00\x00m\x7F%\x00 
> {code}
>  
> The rest of the columns are ok.   
> So is this a spark problem or Apache drill? 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37320) Delete py_container_checks.zip after the test in DepsTestsSuite finishes

2021-11-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37320:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Delete py_container_checks.zip after the test in DepsTestsSuite finishes
> 
>
> Key: SPARK-37320
> URL: https://issues.apache.org/jira/browse/SPARK-37320
> Project: Spark
>  Issue Type: Bug
>  Components: k8, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> When K8s integration tests run, py_container_checks.zip  still remains in 
> resource-managers/kubernetes/integration-tests/tests/.
> It's is created in the test "Launcher python client dependencies using a zip 
> file" in DepsTestsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37320) Delete py_container_checks.zip after the test in DepsTestsSuite finishes

2021-11-14 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443269#comment-17443269
 ] 

Apache Spark commented on SPARK-37320:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34588

> Delete py_container_checks.zip after the test in DepsTestsSuite finishes
> 
>
> Key: SPARK-37320
> URL: https://issues.apache.org/jira/browse/SPARK-37320
> Project: Spark
>  Issue Type: Bug
>  Components: k8, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> When K8s integration tests run, py_container_checks.zip  still remains in 
> resource-managers/kubernetes/integration-tests/tests/.
> It's is created in the test "Launcher python client dependencies using a zip 
> file" in DepsTestsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37320) Delete py_container_checks.zip after the test in DepsTestsSuite finishes

2021-11-14 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37320:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Delete py_container_checks.zip after the test in DepsTestsSuite finishes
> 
>
> Key: SPARK-37320
> URL: https://issues.apache.org/jira/browse/SPARK-37320
> Project: Spark
>  Issue Type: Bug
>  Components: k8, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> When K8s integration tests run, py_container_checks.zip  still remains in 
> resource-managers/kubernetes/integration-tests/tests/.
> It's is created in the test "Launcher python client dependencies using a zip 
> file" in DepsTestsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37318) Make FallbackStorageSuite robust in terms of DNS

2021-11-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-37318:
-

Assignee: Dongjoon Hyun

> Make FallbackStorageSuite robust in terms of DNS
> 
>
> Key: SPARK-37318
> URL: https://issues.apache.org/jira/browse/SPARK-37318
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Usually, the test case expects the hostname doesn't exist.
> {code}
> $ ping remote
> ping: cannot resolve remote: Unknown host
> {code}
> In some DNS environments, it returns always.
> {code}
> $ ping remote
> PING remote (23.217.138.110): 56 data bytes
> 64 bytes from 23.217.138.110: icmp_seq=0 ttl=57 time=8.660 ms
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37318) Make FallbackStorageSuite robust in terms of DNS

2021-11-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-37318.
---
Fix Version/s: 3.3.0
   3.2.1
   Resolution: Fixed

Issue resolved by pull request 34585
[https://github.com/apache/spark/pull/34585]

> Make FallbackStorageSuite robust in terms of DNS
> 
>
> Key: SPARK-37318
> URL: https://issues.apache.org/jira/browse/SPARK-37318
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0, 3.2.1
>
>
> Usually, the test case expects the hostname doesn't exist.
> {code}
> $ ping remote
> ping: cannot resolve remote: Unknown host
> {code}
> In some DNS environments, it returns always.
> {code}
> $ ping remote
> PING remote (23.217.138.110): 56 data bytes
> 64 bytes from 23.217.138.110: icmp_seq=0 ttl=57 time=8.660 ms
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37320) Delete py_container_checks.zip after the test in DepsTestsSuite finishes

2021-11-14 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-37320:
---
Description: 
When K8s integration tests run, py_container_checks.zip  still remains in 
resource-managers/kubernetes/integration-tests/tests/.
It's is created in the test "Launcher python client dependencies using a zip 
file" in DepsTestsSuite.

  was:
When K8s integration tests run, py_container_checks.zip is still remaining in 
resource-managers/kubernetes/integration-tests/tests/.
It's is created in the test "Launcher python client dependencies using a zip 
file" in DepsTestsSuite.


> Delete py_container_checks.zip after the test in DepsTestsSuite finishes
> 
>
> Key: SPARK-37320
> URL: https://issues.apache.org/jira/browse/SPARK-37320
> Project: Spark
>  Issue Type: Bug
>  Components: k8, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> When K8s integration tests run, py_container_checks.zip  still remains in 
> resource-managers/kubernetes/integration-tests/tests/.
> It's is created in the test "Launcher python client dependencies using a zip 
> file" in DepsTestsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37320) Delete py_container_checks.zip after the test in DepsTestsSuite finishes

2021-11-14 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-37320:
--

 Summary: Delete py_container_checks.zip after the test in 
DepsTestsSuite finishes
 Key: SPARK-37320
 URL: https://issues.apache.org/jira/browse/SPARK-37320
 Project: Spark
  Issue Type: Bug
  Components: k8, Tests
Affects Versions: 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


When K8s integration tests run, py_container_checks.zip is still remaining in 
resource-managers/kubernetes/integration-tests/tests/.
It's is created in the test "Launcher python client dependencies using a zip 
file" in DepsTestsSuite.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org