date:20200109

[jira] [Updated] (SPARK-30400) Test failure in SQL module on ppc64le

2020-01-09 Thread AK97 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AK97 updated SPARK-30400:
-
Shepherd: Yin Huai

> Test failure in SQL module on ppc64le
> -
>
> Key: SPARK-30400
> URL: https://issues.apache.org/jira/browse/SPARK-30400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
> Environment: os: rhel 7.6
> arch: ppc64le
>Reporter: AK97
>Priority: Major
>
> I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, 
> the test cases are failing in SQL module with following error :
> {code}
> - CREATE TABLE USING AS SELECT based on the file without write permission *** 
> FAILED ***
>   Expected exception org.apache.spark.SparkException to be thrown, but no 
> exception was thrown (CreateTableAsSelectSuite.scala:92)
> - create a table, drop it and create another one with the same name *** 
> FAILED ***
>   org.apache.spark.sql.AnalysisException: Table default.jsonTable already 
> exists. You need to drop it first.;
> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
>   at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> {code}
> Would like some help on understanding the cause for the same . I am running 
> it on a High end VM with good connectivity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30400) Test failure in SQL module on ppc64le

2020-01-09 Thread AK97 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AK97 updated SPARK-30400:
-
Shepherd:   (was: Yin Huai)

> Test failure in SQL module on ppc64le
> -
>
> Key: SPARK-30400
> URL: https://issues.apache.org/jira/browse/SPARK-30400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
> Environment: os: rhel 7.6
> arch: ppc64le
>Reporter: AK97
>Priority: Major
>
> I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, 
> the test cases are failing in SQL module with following error :
> {code}
> - CREATE TABLE USING AS SELECT based on the file without write permission *** 
> FAILED ***
>   Expected exception org.apache.spark.SparkException to be thrown, but no 
> exception was thrown (CreateTableAsSelectSuite.scala:92)
> - create a table, drop it and create another one with the same name *** 
> FAILED ***
>   org.apache.spark.sql.AnalysisException: Table default.jsonTable already 
> exists. You need to drop it first.;
> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
>   at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> {code}
> Would like some help on understanding the cause for the same . I am running 
> it on a High end VM with good connectivity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30400) Test failure in SQL module on ppc64le

2020-01-09 Thread AK97 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AK97 updated SPARK-30400:
-
   Shepherd: Yin Huai
Environment: 
os: rhel 7.6
arch: ppc64le

  was:

os: rhel 7.6
arch: ppc64le


> Test failure in SQL module on ppc64le
> -
>
> Key: SPARK-30400
> URL: https://issues.apache.org/jira/browse/SPARK-30400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
> Environment: os: rhel 7.6
> arch: ppc64le
>Reporter: AK97
>Priority: Major
>
> I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, 
> the test cases are failing in SQL module with following error :
> {code}
> - CREATE TABLE USING AS SELECT based on the file without write permission *** 
> FAILED ***
>   Expected exception org.apache.spark.SparkException to be thrown, but no 
> exception was thrown (CreateTableAsSelectSuite.scala:92)
> - create a table, drop it and create another one with the same name *** 
> FAILED ***
>   org.apache.spark.sql.AnalysisException: Table default.jsonTable already 
> exists. You need to drop it first.;
> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
>   at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> {code}
> Would like some help on understanding the cause for the same . I am running 
> it on a High end VM with good connectivity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30400) Test failure in SQL module on ppc64le

2020-01-09 Thread AK97 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012542#comment-17012542
 ] 

AK97 commented on SPARK-30400:
--

Any Leads will be appreciated.


> Test failure in SQL module on ppc64le
> -
>
> Key: SPARK-30400
> URL: https://issues.apache.org/jira/browse/SPARK-30400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
> Environment: os: rhel 7.6
> arch: ppc64le
>Reporter: AK97
>Priority: Major
>
> I have been trying to build the Apache Spark on rhel_7.6/ppc64le; however, 
> the test cases are failing in SQL module with following error :
> {code}
> - CREATE TABLE USING AS SELECT based on the file without write permission *** 
> FAILED ***
>   Expected exception org.apache.spark.SparkException to be thrown, but no 
> exception was thrown (CreateTableAsSelectSuite.scala:92)
> - create a table, drop it and create another one with the same name *** 
> FAILED ***
>   org.apache.spark.sql.AnalysisException: Table default.jsonTable already 
> exists. You need to drop it first.;
> at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:159)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:115)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195)
>   at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> {code}
> Would like some help on understanding the cause for the same . I am running 
> it on a High end VM with good connectivity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30481) Integrate event log compactor into Spark History Server

2020-01-09 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-30481:


 Summary: Integrate event log compactor into Spark History Server
 Key: SPARK-30481
 URL: https://issues.apache.org/jira/browse/SPARK-30481
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


This issue is to track the effort on compacting old event logs (and cleaning up 
after compaction) without breaking guaranteeing of compatibility.

This issue depends on SPARK-29779 and SPARK-30479, and focuses on integrating 
event log compactor into Spark History Server and enable configurations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30480) Pyspark test "test_memory_limit" fails consistently

2020-01-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30480.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Fixed in [https://github.com/apache/spark/pull/27162]

> Pyspark test "test_memory_limit" fails consistently
> ---
>
> Key: SPARK-30480
> URL: https://issues.apache.org/jira/browse/SPARK-30480
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> I'm seeing consistent pyspark test failures on multiple PRs 
> ([#26955|https://github.com/apache/spark/pull/26955], 
> [#26201|https://github.com/apache/spark/pull/26201], 
> [#27064|https://github.com/apache/spark/pull/27064]), and they all failed 
> from "test_memory_limit".
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116422/testReport]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116438/testReport]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116429/testReport]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116366/testReport]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-29776) rpad and lpad should return NULL when padstring parameter is empty

2020-01-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-29776.
-

> rpad and lpad should return NULL when padstring parameter is empty
> --
>
> Key: SPARK-29776
> URL: https://issues.apache.org/jira/browse/SPARK-29776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> As per rpad definition
>  rpad
>  rpad(str, len, pad) - Returns str, right-padded with pad to a length of len
>  If str is longer than len, the return value is shortened to len characters.
>  *In case of empty pad string, the return value is null.*
> Below is Example
> In Spark:
> {code}
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT rpad('hi', 5, '');
> ++
> | rpad(hi, 5, ) |
> ++
> | hi |
> ++
> {code}
> It should return NULL as per definition.
>  
> Hive behavior is correct as per definition it returns NULL when pad is empty 
> String
> INFO : Concurrency mode is disabled, not creating a lock manager
> {code}
> +---+
> | _c0 |
> +---+
> | NULL |
> +---+
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30480) Pyspark test "test_memory_limit" fails consistently

2020-01-09 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-30480:


 Summary: Pyspark test "test_memory_limit" fails consistently
 Key: SPARK-30480
 URL: https://issues.apache.org/jira/browse/SPARK-30480
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


I'm seeing consistent pyspark test failures on multiple PRs 
([#26955|https://github.com/apache/spark/pull/26955], 
[#26201|https://github.com/apache/spark/pull/26201], 
[#27064|https://github.com/apache/spark/pull/27064]), and they all failed from 
"test_memory_limit".

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116422/testReport]

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116438/testReport]

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116429/testReport]

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116366/testReport]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27686) Update migration guide

2020-01-09 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27686:

Attachment: hive-1.2.1-lib.tgz

> Update migration guide 
> ---
>
> Key: SPARK-27686
> URL: https://issues.apache.org/jira/browse/SPARK-27686
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Minor
> Attachments: hive-1.2.1-lib.tgz
>
>
> The built-in Hive 2.3 fixes the following issues:
>  * HIVE-6727: Table level stats for external tables are set incorrectly.
>  * HIVE-15653: Some ALTER TABLE commands drop table stats.
>  * SPARK-12014: Spark SQL query containing semicolon is broken in Beeline.
>  * SPARK-25193: insert overwrite doesn't throw exception when drop old data 
> fails.
>  * SPARK-25919: Date value corrupts when tables are "ParquetHiveSerDe" 
> formatted and target table is Partitioned.
>  * SPARK-26332: Spark sql write orc table on viewFS throws exception.
>  * SPARK-26437: Decimal data becomes bigint to query, unable to query.
> We need update migration guide.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30479) Apply compaction of event log to SQL events

2020-01-09 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-30479:


 Summary: Apply compaction of event log to SQL events
 Key: SPARK-30479
 URL: https://issues.apache.org/jira/browse/SPARK-30479
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


This issue is to track the effort on compacting old event logs (and cleaning up 
after compaction) without breaking guaranteeing of compatibility.

This issue depends on SPARK-29779 and focuses on dealing with SQL events.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29779) Compact old event log files and clean up

2020-01-09 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-29779:
-
Description: 
This issue is to track the effort on compacting old event logs (and cleaning up 
after compaction) without breaking guaranteeing of compatibility.

Please note that this issue leaves below functionalities for future JIRA issue 
as the patch for SPARK-29779 is too huge and we decided to break down.
 * apply filter in SQL events
 * integrate compaction into FsHistoryProvider
 * documentation about new configuration

  was:This issue is to track the effort on compacting old event logs (and 
cleaning up after compaction) without breaking guaranteeing of compatibility.


> Compact old event log files and clean up
> 
>
> Key: SPARK-29779
> URL: https://issues.apache.org/jira/browse/SPARK-29779
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue is to track the effort on compacting old event logs (and cleaning 
> up after compaction) without breaking guaranteeing of compatibility.
> Please note that this issue leaves below functionalities for future JIRA 
> issue as the patch for SPARK-29779 is too huge and we decided to break down.
>  * apply filter in SQL events
>  * integrate compaction into FsHistoryProvider
>  * documentation about new configuration



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30477) More KeyValueGroupedDataset methods should be composable

2020-01-09 Thread Paul Jones (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Jones updated SPARK-30477:
---
Description: 
Right now many `KeyValueGroupedDataset` do not return a 
`KeyValueGroupedDataset`. In some cases this means we have to do multiple 
`groupByKey`s into order to express certain patterns.


Setup

{code:scala}
def f: T => K
def g: U => K
def h: V => K
val ds1: Dataset[T] = ???
val ds2: Dataset[U] = ???
val ds3: Dataset[V] = ??? 
val kvDs1: KeyValueGroupedDataset[K, T] = ds1.groupByKey(f)
val kvDs2: KeyValueGroupedDataset[K, U] = ds2.groupByKey(g)
val kvDs3: KeyValueGroupedDataset[K, V] = ds3.groupByKey(h)
{code}

Example one: Combining multiple CoGrouped Dataset. 

{code:scala}
// Current
kvDs1
  .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X)
  .groupByKey((x: X) => ???: K)
  .coGroup(kvDs3)(k: K, it1: Iterator[X], it2: Iterator[Y]) => ???: Z)

// Wanted
trait KeyValueGroupedDataset[K, T] {
  def coGroupKeyValueGroupedDataset[U, X](r: KeyValueGroupedDataset)(K, 
Iterator[T], Iterator[U] => X): KeyValueGroupedDataset[K, X]
}

kvDs1
  .coGroupKeyValueGroupedDataset(kvDs2)(k: K, it1: Iterator[T], it2: 
Iterator[U]) => ???: X)
  .coGroupKeyValueGroupedDataset(kvDs3)(k: K, it1: Iterator[X], it2: 
Iterator[Y]) => ???: Z)
{code}

Example two: Combining a reduceGroups with a coGroup 
{code:scala}
// current
val newDs1: Dataset[X] = kvDs1
  .reduceGroups((l: T, r: T) => ???: T))
  .groupByKey {case (k, _) => k }
  .mapValues { case (_, v) => v }
  .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X)

// wanted
trait KeyValueGroupedDataset[K, T] {
  def reduceGroupsKeyValueGroupedDataset(v: (V, V) => V): 
KeyValueGroupedDataset[K, V]
}

val newDs2: Dataset[X] = kvDs1
  .reduceGroupsKeyValueGroupedDataset((l: T, r: T) => ???: T))
  .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X)
{code}

In both cases not only are the ergonomics better, Spark will better able to 
optimize the code. 

For almost every method of `KeyValueGroupedDataset` we should have a matching 
method that returns a `KeyValueGroupedDataset`. 

We can also add a `.toDs` method which converts a `KeyValueGroupedDataset[K, 
V]` to a `Dataset[(K, V)]`

  was:
Right now many `KeyValueGroupedDataset` do not return a 
`KeyValueGroupedDataset`. In some cases this means we have to do multiple 
`groupByKey`s into order to express certain patterns.


Setup

{code:scala}
def f: T => K
def g: U => K
def h: V => K
val ds1: Dataset[T] = ???
val ds2: Dataset[U] = ???
val ds3: Dataset[V] = ??? 
val kvDs1: KeyValueGroupedDataset[K, T] = ds1.groupByKey(f)
val kvDs2: KeyValueGroupedDataset[K, U] = ds2.groupByKey(g)
val kvDs3: KeyValueGroupedDataset[K, V] = ds3.groupByKey(h)
{code}

Example one: Combining multiple CoGrouped Dataset. 

{code:scala}
// Current
kvDs1
  .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X)
  .groupByKey((x: X) => ???: K)
  .coGroup(kvDs3)(k: K, it1: Iterator[X], it2: Iterator[Y]) => ???: Z)

// Wanted
trait KeyValueGroupedDataset[K, T] {
  def coGroupKeyValueGroupedDataset[U, X](r: KeyValueGroupedDataset)(K, 
Iterator[T], Iterator[U] => X): KeyValueGroupedDataset[K, X]
}

kvDs1
  .coGroupKeyValueGroupedDataset(kvDs2)(k: K, it1: Iterator[T], it2: 
Iterator[U]) => ???: X)
  .coGroupKeyValueGroupedDataset(kvDs3)(k: K, it1: Iterator[X], it2: 
Iterator[Y]) => ???: Z)
{code}

Example two: Combining a reduceGroups with a coGroup 
{code:scala}
// current
val newDs1: Dataset[X] = kvDs1
  .reduceGroups((l: T, r: T) => ???: T))
  .groupByKey {case (k, _) => k }.mapValues { case (_, v) => v }
  .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X)

// wanted
trait KeyValueGroupedDataset[K, T] {
  def reduceGroupsKeyValueGroupedDataset(v: (V, V) => V): 
KeyValueGroupedDataset[K, V]
}

val newDs2: Dataset[X] = kvDs1
  .reduceGroupsKeyValueGroupedDataset((l: T, r: T) => ???: T))
  .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X)
{code}

In both cases not only are the ergonomics better, Spark will better able to 
optimize the code. 

For almost every method of `KeyValueGroupedDataset` we should have a matching 
method that returns a `KeyValueGroupedDataset`. 

We can also add a `.toDs` method which converts a `KeyValueGroupedDataset[K, 
V]` to a `Dataset[(K, V)]`


> More KeyValueGroupedDataset methods should be composable
> 
>
> Key: SPARK-30477
> URL: https://issues.apache.org/jira/browse/SPARK-30477
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Paul Jones
>Priority: Major
>
> Right now many `KeyValueGroupedDataset` do not return a 
> `KeyValueGroupedDataset`. In some cases this means we have to do multiple 
> `groupByKey`s into order to express certai

[jira] [Updated] (SPARK-27686) Update migration guide

2020-01-09 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27686:

Description: 
The built-in Hive 2.3 fixes the following issues:
 * HIVE-6727: Table level stats for external tables are set incorrectly.
 * HIVE-15653: Some ALTER TABLE commands drop table stats.
 * SPARK-12014: Spark SQL query containing semicolon is broken in Beeline.
 * SPARK-25193: insert overwrite doesn't throw exception when drop old data 
fails.
 * SPARK-25919: Date value corrupts when tables are "ParquetHiveSerDe" 
formatted and target table is Partitioned.
 * SPARK-26332: Spark sql write orc table on viewFS throws exception.
 * SPARK-26437: Decimal data becomes bigint to query, unable to query.

We need update migration guide.


  was:
The built-in Hive 2.3 fixes the following issues:
 * HIVE-6727: Table level stats for external tables are set incorrectly.
 * HIVE-15653: Some ALTER TABLE commands drop table stats.
 * SPARK-12014: Spark SQL query containing semicolon is broken in Beeline.
 * SPARK-25193: insert overwrite doesn't throw exception when drop old data 
fails.
 * SPARK-25919: Date value corrupts when tables are "ParquetHiveSerDe" 
formatted and target table is Partitioned.
 * SPARK-26332: Spark sql write orc table on viewFS throws exception.
 * SPARK-26437: Decimal data becomes bigint to query, unable to query.

We need update migration guide.

Please note that this is only fixed in `hadoop-3.2` binary distribution.


> Update migration guide 
> ---
>
> Key: SPARK-27686
> URL: https://issues.apache.org/jira/browse/SPARK-27686
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Minor
>
> The built-in Hive 2.3 fixes the following issues:
>  * HIVE-6727: Table level stats for external tables are set incorrectly.
>  * HIVE-15653: Some ALTER TABLE commands drop table stats.
>  * SPARK-12014: Spark SQL query containing semicolon is broken in Beeline.
>  * SPARK-25193: insert overwrite doesn't throw exception when drop old data 
> fails.
>  * SPARK-25919: Date value corrupts when tables are "ParquetHiveSerDe" 
> formatted and target table is Partitioned.
>  * SPARK-26332: Spark sql write orc table on viewFS throws exception.
>  * SPARK-26437: Decimal data becomes bigint to query, unable to query.
> We need update migration guide.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27686) Update migration guide

2020-01-09 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27686:

Parent Issue: SPARK-30034  (was: SPARK-23710)

> Update migration guide 
> ---
>
> Key: SPARK-27686
> URL: https://issues.apache.org/jira/browse/SPARK-27686
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Minor
>
> The built-in Hive 2.3 fixes the following issues:
>  * HIVE-6727: Table level stats for external tables are set incorrectly.
>  * HIVE-15653: Some ALTER TABLE commands drop table stats.
>  * SPARK-12014: Spark SQL query containing semicolon is broken in Beeline.
>  * SPARK-25193: insert overwrite doesn't throw exception when drop old data 
> fails.
>  * SPARK-25919: Date value corrupts when tables are "ParquetHiveSerDe" 
> formatted and target table is Partitioned.
>  * SPARK-26332: Spark sql write orc table on viewFS throws exception.
>  * SPARK-26437: Decimal data becomes bigint to query, unable to query.
> We need update migration guide.
> Please note that this is only fixed in `hadoop-3.2` binary distribution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30474) Writing data to parquet with dynamic partitionOverwriteMode should not do the folder rename in commitjob stage

2020-01-09 Thread Zaisheng Dai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zaisheng Dai updated SPARK-30474:
-
Description: 
In the current spark implementation if you set,
{code:java}
spark.sql.sources.partitionOverwriteMode=dynamic
{code}
even with 
{code:java}
mapreduce.fileoutputcommitter.algorithm.version=2
{code}
it would still rename the partition folder *sequentially* in commitJob stage as 
shown here: 

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188]

[https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184]

 

This is very slow in cloud storage. We should commit the data similar to 
FileOutputCommitter v2?

 

  was:
In the current spark implementation if you set,
{code:java}
spark.sql.sources.partitionOverwriteMode=dynamic
{code}
even with 
{code:java}
mapreduce.fileoutputcommitter.algorithm.version=2
{code}
it would still rename the partition folder *sequentially* in commitJob stage as 
shown here: 

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188]
 
[https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184]

 

This is very slow in cloud storage. We should commit the data similar to 
FileOutputCommitter v2?

 


> Writing data to parquet with dynamic partitionOverwriteMode should not do the 
> folder rename in commitjob stage
> --
>
> Key: SPARK-30474
> URL: https://issues.apache.org/jira/browse/SPARK-30474
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Zaisheng Dai
>Priority: Minor
>
> In the current spark implementation if you set,
> {code:java}
> spark.sql.sources.partitionOverwriteMode=dynamic
> {code}
> even with 
> {code:java}
> mapreduce.fileoutputcommitter.algorithm.version=2
> {code}
> it would still rename the partition folder *sequentially* in commitJob stage 
> as shown here: 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188]
> [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184]
>  
> This is very slow in cloud storage. We should commit the data similar to 
> FileOutputCommitter v2?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30474) Writing data to parquet with dynamic partitionOverwriteMode should not do the folder rename in commitjob stage

2020-01-09 Thread Zaisheng Dai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zaisheng Dai updated SPARK-30474:
-
Summary: Writing data to parquet with dynamic partitionOverwriteMode should 
not do the folder rename in commitjob stage  (was: Writing data to parquet with 
dynamic partition should not be done in commit job stage)

> Writing data to parquet with dynamic partitionOverwriteMode should not do the 
> folder rename in commitjob stage
> --
>
> Key: SPARK-30474
> URL: https://issues.apache.org/jira/browse/SPARK-30474
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Zaisheng Dai
>Priority: Minor
>
> In the current spark implementation if you set,
> {code:java}
> spark.sql.sources.partitionOverwriteMode=dynamic
> {code}
> even with 
> {code:java}
> mapreduce.fileoutputcommitter.algorithm.version=2
> {code}
> it would still rename the partition folder *sequentially* in commitJob stage 
> as shown here: 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188]
>  
> [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184]
>  
> This is very slow in cloud storage. We should commit the data similar to 
> FileOutputCommitter v2?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27686) Update migration guide

2020-01-09 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012434#comment-17012434
 ] 

Dongjoon Hyun commented on SPARK-27686:
---

Hi, [~yumwang]. Can we have this document?

> Update migration guide 
> ---
>
> Key: SPARK-27686
> URL: https://issues.apache.org/jira/browse/SPARK-27686
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Minor
>
> The built-in Hive 2.3 fixes the following issues:
>  * HIVE-6727: Table level stats for external tables are set incorrectly.
>  * HIVE-15653: Some ALTER TABLE commands drop table stats.
>  * SPARK-12014: Spark SQL query containing semicolon is broken in Beeline.
>  * SPARK-25193: insert overwrite doesn't throw exception when drop old data 
> fails.
>  * SPARK-25919: Date value corrupts when tables are "ParquetHiveSerDe" 
> formatted and target table is Partitioned.
>  * SPARK-26332: Spark sql write orc table on viewFS throws exception.
>  * SPARK-26437: Decimal data becomes bigint to query, unable to query.
> We need update migration guide.
> Please note that this is only fixed in `hadoop-3.2` binary distribution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30441) Improve the memory usage in StronglyConnectedComponents

2020-01-09 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012432#comment-17012432
 ] 

Dongjoon Hyun commented on SPARK-30441:
---

Hi, [~jmzhou]. Please don't set `Fixed Version`. We use that when the 
committers merge the PRs.
- https://spark.apache.org/contributing.html
Also, `New Feature` and `Improvement` should have the version of `master` 
branch because Apache Spark community backports only bug fixes.

> Improve the memory usage in StronglyConnectedComponents
> ---
>
> Key: SPARK-30441
> URL: https://issues.apache.org/jira/browse/SPARK-30441
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 3.0.0
>Reporter: jiamuzhou
>Priority: Major
> Attachments: figure1.png, figure2.png
>
>
> This is very consume memory when It use StronglyConnectedComponents(see 
> figure1.png). Because there is no mark the Graph/RDD as non-persistent in the 
> iterative process timely. And it is maybe lead to fail in the big graph.
> In order to improve the memory usage, it is verty important to mark the 
> Graph/RDD as non-persistent timely. In the current code, only make the 
> Graph/RDD as non-persistent for 'sccGraph' but not for 'sccWorkGraph' in 
> degree's step and pregel's step.
> I have done a optimized code proposal(see my 
> fork:[https://github.com/jmzhoulab/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala])
> The storage after optimization see figure2.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30441) Improve the memory usage in StronglyConnectedComponents

2020-01-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30441:
--
Target Version/s:   (was: 3.0.0)

> Improve the memory usage in StronglyConnectedComponents
> ---
>
> Key: SPARK-30441
> URL: https://issues.apache.org/jira/browse/SPARK-30441
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.1.0, 2.3.0, 2.4.0, 2.4.4
>Reporter: jiamuzhou
>Priority: Major
> Attachments: figure1.png, figure2.png
>
>
> This is very consume memory when It use StronglyConnectedComponents(see 
> figure1.png). Because there is no mark the Graph/RDD as non-persistent in the 
> iterative process timely. And it is maybe lead to fail in the big graph.
> In order to improve the memory usage, it is verty important to mark the 
> Graph/RDD as non-persistent timely. In the current code, only make the 
> Graph/RDD as non-persistent for 'sccGraph' but not for 'sccWorkGraph' in 
> degree's step and pregel's step.
> I have done a optimized code proposal(see my 
> fork:[https://github.com/jmzhoulab/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala])
> The storage after optimization see figure2.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30441) Improve the memory usage in StronglyConnectedComponents

2020-01-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30441:
--
Flags:   (was: Important)

> Improve the memory usage in StronglyConnectedComponents
> ---
>
> Key: SPARK-30441
> URL: https://issues.apache.org/jira/browse/SPARK-30441
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.1.0, 2.3.0, 2.4.0, 2.4.4
>Reporter: jiamuzhou
>Priority: Major
> Attachments: figure1.png, figure2.png
>
>
> This is very consume memory when It use StronglyConnectedComponents(see 
> figure1.png). Because there is no mark the Graph/RDD as non-persistent in the 
> iterative process timely. And it is maybe lead to fail in the big graph.
> In order to improve the memory usage, it is verty important to mark the 
> Graph/RDD as non-persistent timely. In the current code, only make the 
> Graph/RDD as non-persistent for 'sccGraph' but not for 'sccWorkGraph' in 
> degree's step and pregel's step.
> I have done a optimized code proposal(see my 
> fork:[https://github.com/jmzhoulab/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala])
> The storage after optimization see figure2.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30441) Improve the memory usage in StronglyConnectedComponents

2020-01-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30441:
--
Affects Version/s: (was: 2.4.4)
   (was: 2.4.0)
   (was: 2.3.0)
   (was: 2.1.0)
   3.0.0

> Improve the memory usage in StronglyConnectedComponents
> ---
>
> Key: SPARK-30441
> URL: https://issues.apache.org/jira/browse/SPARK-30441
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 3.0.0
>Reporter: jiamuzhou
>Priority: Major
> Attachments: figure1.png, figure2.png
>
>
> This is very consume memory when It use StronglyConnectedComponents(see 
> figure1.png). Because there is no mark the Graph/RDD as non-persistent in the 
> iterative process timely. And it is maybe lead to fail in the big graph.
> In order to improve the memory usage, it is verty important to mark the 
> Graph/RDD as non-persistent timely. In the current code, only make the 
> Graph/RDD as non-persistent for 'sccGraph' but not for 'sccWorkGraph' in 
> degree's step and pregel's step.
> I have done a optimized code proposal(see my 
> fork:[https://github.com/jmzhoulab/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala])
> The storage after optimization see figure2.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30441) Improve the memory usage in StronglyConnectedComponents

2020-01-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30441:
--
Fix Version/s: (was: 3.0.0)

> Improve the memory usage in StronglyConnectedComponents
> ---
>
> Key: SPARK-30441
> URL: https://issues.apache.org/jira/browse/SPARK-30441
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 2.1.0, 2.3.0, 2.4.0, 2.4.4
>Reporter: jiamuzhou
>Priority: Major
> Attachments: figure1.png, figure2.png
>
>
> This is very consume memory when It use StronglyConnectedComponents(see 
> figure1.png). Because there is no mark the Graph/RDD as non-persistent in the 
> iterative process timely. And it is maybe lead to fail in the big graph.
> In order to improve the memory usage, it is verty important to mark the 
> Graph/RDD as non-persistent timely. In the current code, only make the 
> Graph/RDD as non-persistent for 'sccGraph' but not for 'sccWorkGraph' in 
> degree's step and pregel's step.
> I have done a optimized code proposal(see my 
> fork:[https://github.com/jmzhoulab/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala])
> The storage after optimization see figure2.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30296) Dataset diffing transformation

2020-01-09 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012430#comment-17012430
 ] 

Dongjoon Hyun commented on SPARK-30296:
---

Hi, [~EnricoMi].
Please don't set `Fixed Version`. We set that when the committers merge the 
PRs. Also, `New Feature` should have the version of `master` branch, 3.0.0 (as 
of today), because Apache Spark community has a policy which allows 
blackporting bug-fixes only.
- https://spark.apache.org/contributing.html

> Dataset diffing transformation
> --
>
> Key: SPARK-30296
> URL: https://issues.apache.org/jira/browse/SPARK-30296
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Major
>
> Evolving Spark code needs frequent regression testing to prove it still 
> produces identical results, or if changes are expected, to investigate those 
> changes. Diffing the Datasets of two code paths provides confidence.
> Diffing small schemata is easy, but with wide schema the Spark query becomes 
> laborious and error-prone. With a single proven and tested method, diffing 
> becomes easier and a more reliable operation. As a Dataset transformation, 
> you get this operation first hand with your Dataset API.
> This has proven to be useful for interactive spark as well as deployed 
> production code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30296) Dataset diffing transformation

2020-01-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30296:
--
Affects Version/s: (was: 2.4.4)
   3.0.0

> Dataset diffing transformation
> --
>
> Key: SPARK-30296
> URL: https://issues.apache.org/jira/browse/SPARK-30296
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Major
>
> Evolving Spark code needs frequent regression testing to prove it still 
> produces identical results, or if changes are expected, to investigate those 
> changes. Diffing the Datasets of two code paths provides confidence.
> Diffing small schemata is easy, but with wide schema the Spark query becomes 
> laborious and error-prone. With a single proven and tested method, diffing 
> becomes easier and a more reliable operation. As a Dataset transformation, 
> you get this operation first hand with your Dataset API.
> This has proven to be useful for interactive spark as well as deployed 
> production code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30296) Dataset diffing transformation

2020-01-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30296:
--
Fix Version/s: (was: 3.0.0)

> Dataset diffing transformation
> --
>
> Key: SPARK-30296
> URL: https://issues.apache.org/jira/browse/SPARK-30296
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Enrico Minack
>Priority: Major
>
> Evolving Spark code needs frequent regression testing to prove it still 
> produces identical results, or if changes are expected, to investigate those 
> changes. Diffing the Datasets of two code paths provides confidence.
> Diffing small schemata is easy, but with wide schema the Spark query becomes 
> laborious and error-prone. With a single proven and tested method, diffing 
> becomes easier and a more reliable operation. As a Dataset transformation, 
> you get this operation first hand with your Dataset API.
> This has proven to be useful for interactive spark as well as deployed 
> production code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25017) Add test suite for ContextBarrierState

2020-01-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25017:
--
Target Version/s:   (was: 3.0.0)

> Add test suite for ContextBarrierState
> --
>
> Key: SPARK-25017
> URL: https://issues.apache.org/jira/browse/SPARK-25017
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Xingbo Jiang
>Priority: Major
>
> We shall be able to add unit test to ContextBarrierState with a mocked 
> RpcCallContext. Currently it's only covered by end-to-end test in 
> `BarrierTaskContextSuite`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30478) update memory package doc

2020-01-09 Thread SongXun (Jira)

SongXun created SPARK-30478:
---

 Summary: update memory package doc 
 Key: SPARK-30478
 URL: https://issues.apache.org/jira/browse/SPARK-30478
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: SongXun


>From Spark 2.0, the storage memory also uses off heap memory. We update the 
>doc here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30131) Add array_median function

2020-01-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30131:
--
Fix Version/s: (was: 3.0.0)

> Add array_median function
> -
>
> Key: SPARK-30131
> URL: https://issues.apache.org/jira/browse/SPARK-30131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Alexander Hagerf
>Priority: Minor
>
> It is known that there isn't any exact median function in Spark SQL, and this 
> might be a difficult problem to solve efficiently. However, to find the 
> median for an array should be a simple task, and something that users can 
> utilize when collecting numeric values to a list or set.
> This can already be achieved by using sorting and choosing element, but can 
> get cumbersome and if a fully tested function is provided in the API, I think 
> it can solve some headache for many.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30131) Add array_median function

2020-01-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30131:
--
Target Version/s:   (was: 2.4.4)

> Add array_median function
> -
>
> Key: SPARK-30131
> URL: https://issues.apache.org/jira/browse/SPARK-30131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Alexander Hagerf
>Priority: Minor
> Fix For: 3.0.0
>
>
> It is known that there isn't any exact median function in Spark SQL, and this 
> might be a difficult problem to solve efficiently. However, to find the 
> median for an array should be a simple task, and something that users can 
> utilize when collecting numeric values to a list or set.
> This can already be achieved by using sorting and choosing element, but can 
> get cumbersome and if a fully tested function is provided in the API, I think 
> it can solve some headache for many.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30034) Use Apache Hive 2.3 dependency by default

2020-01-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30034.
---
Fix Version/s: 3.0.0
   Resolution: Done

> Use Apache Hive 2.3 dependency by default
> -
>
> Key: SPARK-30034
> URL: https://issues.apache.org/jira/browse/SPARK-30034
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Blocker
>  Labels: release-notes
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination

2020-01-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29988.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Thank you. It looks working. I'll monitor them.

> Adjust Jenkins jobs for `hive-1.2/2.3` combination
> --
>
> Key: SPARK-29988
> URL: https://issues.apache.org/jira/browse/SPARK-29988
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2020-01-09 at 1.59.25 PM.png
>
>
> We need to rename the following Jenkins jobs first.
> spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2
> spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3
> spark-master-test-maven-hadoop-2.7 -> 
> spark-master-test-maven-hadoop-2.7-hive-1.2
> spark-master-test-maven-hadoop-3.2 -> 
> spark-master-test-maven-hadoop-3.2-hive-2.3
> Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs.
> {code}
> -Phive \
> +-Phive-1.2 \
> {code}
> And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs.
> {code}
> -Phive \
> +-Phive-2.3 \
> {code}
> Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins 
> manually. (This should be added to SCM of AmpLab Jenkins.)
> After SPARK-29981, we need to create two new jobs.
> - spark-master-test-sbt-hadoop-2.7-hive-2.3
> - spark-master-test-maven-hadoop-2.7-hive-2.3
> This is for preparation for Apache Spark 3.0.0.
> We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30477) More KeyValueGroupedDataset methods should be composable

2020-01-09 Thread Paul Jones (Jira)

Paul Jones created SPARK-30477:
--

 Summary: More KeyValueGroupedDataset methods should be composable
 Key: SPARK-30477
 URL: https://issues.apache.org/jira/browse/SPARK-30477
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Paul Jones


Right now many `KeyValueGroupedDataset` do not return a 
`KeyValueGroupedDataset`. In some cases this means we have to do multiple 
`groupByKey`s into order to express certain patterns.


Setup

{code:scala}
def f: T => K
def g: U => K
def h: V => K
val ds1: Dataset[T] = ???
val ds2: Dataset[U] = ???
val ds3: Dataset[V] = ??? 
val kvDs1: KeyValueGroupedDataset[K, T] = ds1.groupByKey(f)
val kvDs2: KeyValueGroupedDataset[K, U] = ds2.groupByKey(g)
val kvDs3: KeyValueGroupedDataset[K, V] = ds3.groupByKey(h)
{code}

Example one: Combining multiple CoGrouped Dataset. 

{code:scala}
// Current
kvDs1
  .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X)
  .groupByKey((x: X) => ???: K)
  .coGroup(kvDs3)(k: K, it1: Iterator[X], it2: Iterator[Y]) => ???: Z)

// Wanted
trait KeyValueGroupedDataset[K, T] {
  def coGroupKeyValueGroupedDataset[U, X](r: KeyValueGroupedDataset)(K, 
Iterator[T], Iterator[U] => X): KeyValueGroupedDataset[K, X]
}

kvDs1
  .coGroupKeyValueGroupedDataset(kvDs2)(k: K, it1: Iterator[T], it2: 
Iterator[U]) => ???: X)
  .coGroupKeyValueGroupedDataset(kvDs3)(k: K, it1: Iterator[X], it2: 
Iterator[Y]) => ???: Z)
{code}

Example two: Combining a reduceGroups with a coGroup 
{code:scala}
// current
val newDs1: Dataset[X] = kvDs1
  .reduceGroups((l: T, r: T) => ???: T))
  .groupByKey {case (k, _) => k }.mapValues { case (_, v) => v }
  .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X)

// wanted
trait KeyValueGroupedDataset[K, T] {
  def reduceGroupsKeyValueGroupedDataset(v: (V, V) => V): 
KeyValueGroupedDataset[K, V]
}

val newDs2: Dataset[X] = kvDs1
  .reduceGroupsKeyValueGroupedDataset((l: T, r: T) => ???: T))
  .coGroup(kvDs2)(k: K, it1: Iterator[T], it2: Iterator[U]) => ???: X)
{code}

In both cases not only are the ergonomics better, Spark will better able to 
optimize the code. 

For almost every method of `KeyValueGroupedDataset` we should have a matching 
method that returns a `KeyValueGroupedDataset`. 

We can also add a `.toDs` method which converts a `KeyValueGroupedDataset[K, 
V]` to a `Dataset[(K, V)]`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28396) Add PathCatalog for data source V2

2020-01-09 Thread Gengliang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012381#comment-17012381
 ] 

Gengliang Wang commented on SPARK-28396:


[~jerrychenhf] they are still handled by V1 implementation

> Add PathCatalog for data source V2
> --
>
> Key: SPARK-28396
> URL: https://issues.apache.org/jira/browse/SPARK-28396
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add PathCatalog for data source V2, so that:
> 1. We can convert SaveMode in DataFrameWriter into catalog table operations, 
> instead of supporting SaveMode in file source V2.
> 2. Support create-table SQL statements like "CREATE TABLE orc.'path'"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30476) NullPointException when Insert data to hive mongo external table by spark-sql

2020-01-09 Thread XiongCheng (Jira)

XiongCheng created SPARK-30476:
--

 Summary: NullPointException when Insert data to hive mongo 
external table by spark-sql
 Key: SPARK-30476
 URL: https://issues.apache.org/jira/browse/SPARK-30476
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
 Environment: mongo-hadoop: 2.0.2

spark-version: 2.4.3

scala-version: 2.11

hive-version: 1.2.1

hadoop-version: 2.6.0
Reporter: XiongCheng


I execute the sql,but i got a NPE.

result_data_mongo is a mongodb hive external table.
{code:java}
insert into result_data_mongo 
values("15","15","15",15,"15",15,15,15,15,15,15,15,15,15,15);
{code}
NPE detail:
{code:java}
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120)
at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:108)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at 
com.mongodb.hadoop.output.MongoOutputCommitter.getTaskAttemptPath(MongoOutputCommitter.java:264)
at 
com.mongodb.hadoop.output.MongoRecordWriter.(MongoRecordWriter.java:59)
at 
com.mongodb.hadoop.hive.output.HiveMongoOutputFormat$HiveMongoRecordWriter.(HiveMongoOutputFormat.java:80)
at 
com.mongodb.hadoop.hive.output.HiveMongoOutputFormat.getHiveRecordWriter(HiveMongoOutputFormat.java:52)
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)
... 15 more
{code}
I know mongo-hadoop use the incorrect key to get TaskAttemptID,so I modified 
the source code of mongo-hadoop to get the correct properties 
('mapreduce.task.id' and 'mapreduce.task.attempt.id'), but I still can't get 
the value. I found that these parameters are stored in spark In 
TaskAttemptContext, but TaskAttemptContext is not passed into HiveOutputWriter, 
is this a design flaw?

here are two key point.

mongo-hadoop: 
[https://github.com/mongodb/mongo-hadoop/blob/cdcd0f15503f2d1c5a1a2e3941711d850d1e427b/hive/src/main/java/com/mongodb/hadoop/hive/output/HiveMongoOutputFormat.java#L80]

spark-hive:[https://github.com/apache/spark/blob/7c7d7f6a878b02ece881266ee538f3e1443aa8c1/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala#L103]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30439) support NOT NULL in column data type

2020-01-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30439.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27110
[https://github.com/apache/spark/pull/27110]

> support NOT NULL in column data type
> 
>
> Key: SPARK-30439
> URL: https://issues.apache.org/jira/browse/SPARK-30439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30416) Log a warning for deprecated SQL config in `set()` and `unset()`

2020-01-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30416.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27092
[https://github.com/apache/spark/pull/27092]

> Log a warning for deprecated SQL config in `set()` and `unset()`
> 
>
> Key: SPARK-30416
> URL: https://issues.apache.org/jira/browse/SPARK-30416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> - Gather deprecated SQL configs and add extra info - when a config was 
> deprecated and why
> - Output warning about deprecated SQL config in set() and unset()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30416) Log a warning for deprecated SQL config in `set()` and `unset()`

2020-01-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30416:


Assignee: Maxim Gekk

> Log a warning for deprecated SQL config in `set()` and `unset()`
> 
>
> Key: SPARK-30416
> URL: https://issues.apache.org/jira/browse/SPARK-30416
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> - Gather deprecated SQL configs and add extra info - when a config was 
> deprecated and why
> - Output warning about deprecated SQL config in set() and unset()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30468) Use multiple lines to display data columns for show create table command

2020-01-09 Thread Zhenhua Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-30468:
-
Description: 
Currently data columns are displayed in one line for show create table command, 
when the table has many columns (to make things even worse, columns may have 
long names or comments), the displayed result is really hard to read.

To improve readability, we could print each column in a separate line. Note 
that other systems like Hive/MySQL also display in this way.

Also, for data columns, table properties and options, we'd better put the right 
parenthesis to the end of the last column/property/option, instead of occupying 
a separate line.

As a result, before the change:
{noformat}
spark-sql> show create table test_table;
CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', 
`col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 
'This is comment for column 3')
USING parquet
OPTIONS (
  `bar` '2',
  `foo` '1'
)
TBLPROPERTIES (
  'a' = 'x',
  'b' = 'y'
)
{noformat}
after the change:
{noformat}
spark-sql> show create table test_table;
CREATE TABLE `test_table` (
  `col1` INT COMMENT 'This is comment for column 1',
  `col2` STRING COMMENT 'This is comment for column 2',
  `col3` DOUBLE COMMENT 'This is comment for column 3')
USING parquet
OPTIONS (
  `bar` '2',
  `foo` '1')
TBLPROPERTIES (
  'a' = 'x',
  'b' = 'y')
{noformat}



  was:
Currently data columns are displayed in one line for show create table command, 
when the table has many columns (to make things even worse, columns may have 
long names or comments), the displayed result is really hard to read. E.g.
{noformat}
spark-sql> show create table test_table;

CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', 
`col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 
'This is comment for column 3')
USING parquet

{noformat}
To improve readability, we should print each column in a separate line.
{noformat}
spark-sql> show create table test_table;

CREATE TABLE `test_table` (
  `col1` INT COMMENT 'This is comment for column 1',
  `col2` STRING COMMENT 'This is comment for column 2',
  `col3` DOUBLE COMMENT 'This is comment for column 3')
USING parquet

{noformat}


> Use multiple lines to display data columns for show create table command
> 
>
> Key: SPARK-30468
> URL: https://issues.apache.org/jira/browse/SPARK-30468
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Priority: Minor
>
> Currently data columns are displayed in one line for show create table 
> command, when the table has many columns (to make things even worse, columns 
> may have long names or comments), the displayed result is really hard to read.
> To improve readability, we could print each column in a separate line. Note 
> that other systems like Hive/MySQL also display in this way.
> Also, for data columns, table properties and options, we'd better put the 
> right parenthesis to the end of the last column/property/option, instead of 
> occupying a separate line.
> As a result, before the change:
> {noformat}
> spark-sql> show create table test_table;
> CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', 
> `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 
> 'This is comment for column 3')
> USING parquet
> OPTIONS (
>   `bar` '2',
>   `foo` '1'
> )
> TBLPROPERTIES (
>   'a' = 'x',
>   'b' = 'y'
> )
> {noformat}
> after the change:
> {noformat}
> spark-sql> show create table test_table;
> CREATE TABLE `test_table` (
>   `col1` INT COMMENT 'This is comment for column 1',
>   `col2` STRING COMMENT 'This is comment for column 2',
>   `col3` DOUBLE COMMENT 'This is comment for column 3')
> USING parquet
> OPTIONS (
>   `bar` '2',
>   `foo` '1')
> TBLPROPERTIES (
>   'a' = 'x',
>   'b' = 'y')
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28396) Add PathCatalog for data source V2

2020-01-09 Thread Haifeng Chen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012336#comment-17012336
 ] 

Haifeng Chen commented on SPARK-28396:
--

[~Gengliang.Wang] Gengliang, I am trying to understand how Hive catalog tables 
connected with data source V2 API in the current implementation. Just to check 
with you that in current Spark 3.0 implementation, has Hive catalog tables or 
thrift server catalog tables  already gone through Data source V2 
implementation or they are still handled by V1 implementation?

> Add PathCatalog for data source V2
> --
>
> Key: SPARK-28396
> URL: https://issues.apache.org/jira/browse/SPARK-28396
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> Add PathCatalog for data source V2, so that:
> 1. We can convert SaveMode in DataFrameWriter into catalog table operations, 
> instead of supporting SaveMode in file source V2.
> 2. Support create-table SQL statements like "CREATE TABLE orc.'path'"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24714) AnalysisSuite should use ClassTag to check the runtime instance

2020-01-09 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-24714.
--
Resolution: Won't Fix

> AnalysisSuite should use ClassTag to check the runtime instance
> ---
>
> Key: SPARK-24714
> URL: https://issues.apache.org/jira/browse/SPARK-24714
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chia-Ping Tsai
>Priority: Minor
>
> {code:java}
> test("SPARK-22614 RepartitionByExpression partitioning") {
> def checkPartitioning[T <: Partitioning](numPartitions: Int, exprs: 
> Expression*): Unit = {
>   val partitioning = RepartitionByExpression(exprs, testRelation2, 
> numPartitions).partitioning
>   assert(partitioning.isInstanceOf[T]) // it always be true because of type 
> erasure
> }{code}
> Spark support the scala 2.10 and 2.11 so it is ok to introduce ClassTag to 
> correct the type check.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24714) AnalysisSuite should use ClassTag to check the runtime instance

2020-01-09 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012323#comment-17012323
 ] 

Takeshi Yamamuro commented on SPARK-24714:
--

I'll close this because the corresponding pr is inactive. If necessary, please 
reopen it.

> AnalysisSuite should use ClassTag to check the runtime instance
> ---
>
> Key: SPARK-24714
> URL: https://issues.apache.org/jira/browse/SPARK-24714
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chia-Ping Tsai
>Priority: Minor
>
> {code:java}
> test("SPARK-22614 RepartitionByExpression partitioning") {
> def checkPartitioning[T <: Partitioning](numPartitions: Int, exprs: 
> Expression*): Unit = {
>   val partitioning = RepartitionByExpression(exprs, testRelation2, 
> numPartitions).partitioning
>   assert(partitioning.isInstanceOf[T]) // it always be true because of type 
> erasure
> }{code}
> Spark support the scala 2.10 and 2.11 so it is ok to introduce ClassTag to 
> correct the type check.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30475) File source V2: Push data filters for file listing

2020-01-09 Thread Guy Khazma (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guy Khazma updated SPARK-30475:
---
Description: 
Follow up on [SPARK-30428|https://github.com/apache/spark/pull/27112] which 
added support for partition pruning in File source V2.
 We should also pass the {{dataFilters}} to the {{listFiles method.}}

Datasources such as {{csv}} and {{json}} do not implement the 
{{SupportsPushDownFilters}} trait. In order to support data skipping uniformly 
for all file based data sources, one can override the {{listFiles}} method in a 
{{FileIndex}} implementation and use the {{dataFilters}} and partitionFilters 
to consult external metadata and prunes the list of files.

  was:
Follow up on [SPARK-30428|https://github.com/apache/spark/pull/27112] which 
added support for partition pruning in File source V2.
We should also pass the {{dataFilters}} to the {{listFiles method.}}

Datasources such as {{csv}} and {{json}} do not implement the 
{{SupportsPushDownFilters}} trait. In order to support data skipping uniformly 
for all file based data sources, one can override the {{listFiles}} method in a 
{{FileIndex}} implementation, which consults external metadata and prunes the 
list of files.


> File source V2: Push data filters for file listing
> --
>
> Key: SPARK-30475
> URL: https://issues.apache.org/jira/browse/SPARK-30475
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Guy Khazma
>Priority: Major
> Fix For: 3.0.0
>
>
> Follow up on [SPARK-30428|https://github.com/apache/spark/pull/27112] which 
> added support for partition pruning in File source V2.
>  We should also pass the {{dataFilters}} to the {{listFiles method.}}
> Datasources such as {{csv}} and {{json}} do not implement the 
> {{SupportsPushDownFilters}} trait. In order to support data skipping 
> uniformly for all file based data sources, one can override the {{listFiles}} 
> method in a {{FileIndex}} implementation and use the {{dataFilters}} and 
> partitionFilters to consult external metadata and prunes the list of files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30475) File source V2: Push data filters for file listing

2020-01-09 Thread Guy Khazma (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012298#comment-17012298
 ] 

Guy Khazma commented on SPARK-30475:


PR https://github.com/apache/spark/pull/27157

> File source V2: Push data filters for file listing
> --
>
> Key: SPARK-30475
> URL: https://issues.apache.org/jira/browse/SPARK-30475
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Guy Khazma
>Priority: Major
> Fix For: 3.0.0
>
>
> Follow up on [SPARK-30428|https://github.com/apache/spark/pull/27112] which 
> added support for partition pruning in File source V2.
> We should also pass the {{dataFilters}} to the {{listFiles method.}}
> Datasources such as {{csv}} and {{json}} do not implement the 
> {{SupportsPushDownFilters}} trait. In order to support data skipping 
> uniformly for all file based data sources, one can override the {{listFiles}} 
> method in a {{FileIndex}} implementation, which consults external metadata 
> and prunes the list of files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30475) File source V2: Push data filters for file listing

2020-01-09 Thread Guy Khazma (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guy Khazma updated SPARK-30475:
---
External issue URL: https://github.com/apache/spark/pull/27157

> File source V2: Push data filters for file listing
> --
>
> Key: SPARK-30475
> URL: https://issues.apache.org/jira/browse/SPARK-30475
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Guy Khazma
>Priority: Major
> Fix For: 3.0.0
>
>
> Follow up on [SPARK-30428|https://github.com/apache/spark/pull/27112] which 
> added support for partition pruning in File source V2.
> We should also pass the {{dataFilters}} to the {{listFiles method.}}
> Datasources such as {{csv}} and {{json}} do not implement the 
> {{SupportsPushDownFilters}} trait. In order to support data skipping 
> uniformly for all file based data sources, one can override the {{listFiles}} 
> method in a {{FileIndex}} implementation, which consults external metadata 
> and prunes the list of files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30475) File source V2: Push data filters for file listing

2020-01-09 Thread Guy Khazma (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guy Khazma updated SPARK-30475:
---
External issue URL:   (was: https://github.com/apache/spark/pull/27157)

> File source V2: Push data filters for file listing
> --
>
> Key: SPARK-30475
> URL: https://issues.apache.org/jira/browse/SPARK-30475
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Guy Khazma
>Priority: Major
> Fix For: 3.0.0
>
>
> Follow up on [SPARK-30428|https://github.com/apache/spark/pull/27112] which 
> added support for partition pruning in File source V2.
> We should also pass the {{dataFilters}} to the {{listFiles method.}}
> Datasources such as {{csv}} and {{json}} do not implement the 
> {{SupportsPushDownFilters}} trait. In order to support data skipping 
> uniformly for all file based data sources, one can override the {{listFiles}} 
> method in a {{FileIndex}} implementation, which consults external metadata 
> and prunes the list of files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30475) File source V2: Push data filters for file listing

2020-01-09 Thread Guy Khazma (Jira)

Guy Khazma created SPARK-30475:
--

 Summary: File source V2: Push data filters for file listing
 Key: SPARK-30475
 URL: https://issues.apache.org/jira/browse/SPARK-30475
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Guy Khazma
 Fix For: 3.0.0


Follow up on [SPARK-30428|https://github.com/apache/spark/pull/27112] which 
added support for partition pruning in File source V2.
We should also pass the {{dataFilters}} to the {{listFiles method.}}

Datasources such as {{csv}} and {{json}} do not implement the 
{{SupportsPushDownFilters}} trait. In order to support data skipping uniformly 
for all file based data sources, one can override the {{listFiles}} method in a 
{{FileIndex}} implementation, which consults external metadata and prunes the 
list of files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination

2020-01-09 Thread Shane Knapp (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp updated SPARK-29988:

Attachment: Screen Shot 2020-01-09 at 1.59.25 PM.png

> Adjust Jenkins jobs for `hive-1.2/2.3` combination
> --
>
> Key: SPARK-29988
> URL: https://issues.apache.org/jira/browse/SPARK-29988
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Major
> Attachments: Screen Shot 2020-01-09 at 1.59.25 PM.png
>
>
> We need to rename the following Jenkins jobs first.
> spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2
> spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3
> spark-master-test-maven-hadoop-2.7 -> 
> spark-master-test-maven-hadoop-2.7-hive-1.2
> spark-master-test-maven-hadoop-3.2 -> 
> spark-master-test-maven-hadoop-3.2-hive-2.3
> Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs.
> {code}
> -Phive \
> +-Phive-1.2 \
> {code}
> And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs.
> {code}
> -Phive \
> +-Phive-2.3 \
> {code}
> Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins 
> manually. (This should be added to SCM of AmpLab Jenkins.)
> After SPARK-29981, we need to create two new jobs.
> - spark-master-test-sbt-hadoop-2.7-hive-2.3
> - spark-master-test-maven-hadoop-2.7-hive-2.3
> This is for preparation for Apache Spark 3.0.0.
> We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination

2020-01-09 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012261#comment-17012261
 ] 

Shane Knapp commented on SPARK-29988:
-

it's hard to tell but i disabled the old jobs and all the new ones are running.

> Adjust Jenkins jobs for `hive-1.2/2.3` combination
> --
>
> Key: SPARK-29988
> URL: https://issues.apache.org/jira/browse/SPARK-29988
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Major
> Attachments: Screen Shot 2020-01-09 at 1.59.25 PM.png
>
>
> We need to rename the following Jenkins jobs first.
> spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2
> spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3
> spark-master-test-maven-hadoop-2.7 -> 
> spark-master-test-maven-hadoop-2.7-hive-1.2
> spark-master-test-maven-hadoop-3.2 -> 
> spark-master-test-maven-hadoop-3.2-hive-2.3
> Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs.
> {code}
> -Phive \
> +-Phive-1.2 \
> {code}
> And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs.
> {code}
> -Phive \
> +-Phive-2.3 \
> {code}
> Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins 
> manually. (This should be added to SCM of AmpLab Jenkins.)
> After SPARK-29981, we need to create two new jobs.
> - spark-master-test-sbt-hadoop-2.7-hive-2.3
> - spark-master-test-maven-hadoop-2.7-hive-2.3
> This is for preparation for Apache Spark 3.0.0.
> We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination

2020-01-09 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012260#comment-17012260
 ] 

Shane Knapp commented on SPARK-29988:
-

done!

 

!Screen Shot 2020-01-09 at 1.59.25 PM.png!

> Adjust Jenkins jobs for `hive-1.2/2.3` combination
> --
>
> Key: SPARK-29988
> URL: https://issues.apache.org/jira/browse/SPARK-29988
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Major
> Attachments: Screen Shot 2020-01-09 at 1.59.25 PM.png
>
>
> We need to rename the following Jenkins jobs first.
> spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2
> spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3
> spark-master-test-maven-hadoop-2.7 -> 
> spark-master-test-maven-hadoop-2.7-hive-1.2
> spark-master-test-maven-hadoop-3.2 -> 
> spark-master-test-maven-hadoop-3.2-hive-2.3
> Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs.
> {code}
> -Phive \
> +-Phive-1.2 \
> {code}
> And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs.
> {code}
> -Phive \
> +-Phive-2.3 \
> {code}
> Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins 
> manually. (This should be added to SCM of AmpLab Jenkins.)
> After SPARK-29981, we need to create two new jobs.
> - spark-master-test-sbt-hadoop-2.7-hive-2.3
> - spark-master-test-maven-hadoop-2.7-hive-2.3
> This is for preparation for Apache Spark 3.0.0.
> We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination

2020-01-09 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012246#comment-17012246
 ] 

Shane Knapp commented on SPARK-29988:
-

ok, after banging my head against jenkins job builder, i finally got it to work.

deploying now.

> Adjust Jenkins jobs for `hive-1.2/2.3` combination
> --
>
> Key: SPARK-29988
> URL: https://issues.apache.org/jira/browse/SPARK-29988
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Major
>
> We need to rename the following Jenkins jobs first.
> spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2
> spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3
> spark-master-test-maven-hadoop-2.7 -> 
> spark-master-test-maven-hadoop-2.7-hive-1.2
> spark-master-test-maven-hadoop-3.2 -> 
> spark-master-test-maven-hadoop-3.2-hive-2.3
> Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs.
> {code}
> -Phive \
> +-Phive-1.2 \
> {code}
> And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs.
> {code}
> -Phive \
> +-Phive-2.3 \
> {code}
> Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins 
> manually. (This should be added to SCM of AmpLab Jenkins.)
> After SPARK-29981, we need to create two new jobs.
> - spark-master-test-sbt-hadoop-2.7-hive-2.3
> - spark-master-test-maven-hadoop-2.7-hive-2.3
> This is for preparation for Apache Spark 3.0.0.
> We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer

2020-01-09 Thread Everett Rush (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012229#comment-17012229
 ] 

Everett Rush commented on SPARK-27249:
--

[~nafshartous] 

Hi Nick,

 

I would like to have a "MultiColumnTransformer class" in Spark. I should be 
able to subclass this transformer. I would like the api to be similar to 
UnaryTranformer. So I provide a transformation function and a new schema. Then 
Spark handles the encoding back to a DataFrame and optimizes the computation 
however it can. 

 

class ExampleMulticolumn(override val uid: String, envVars: Map[String, String])
 extends MultiColumnTransformer[ExampleMulticolumn]
 with HasInputCol with DefaultParamsWritable {

 def this() = this(Identifiable.randomUID("exampleMulticolumn"), Map())

 // developer provides the new schema for dataframe
 val newSchema: StructType
 
 override protected def transformFunc: Iterator[Row] => Iterator[Row] = {
 iter => {
 // connect to database
 // iterate over rows in partition
 val new_iter = iter.map{
 row =>
 // do some computation
 row
 }
 new_iter
 }
 }
 override def copy(extra: ParamMap): ExampleMulticolumn = defaultCopy(extra)

}

> Developers API for Transformers beyond UnaryTransformer
> ---
>
> Key: SPARK-27249
> URL: https://issues.apache.org/jira/browse/SPARK-27249
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Everett Rush
>Priority: Minor
>  Labels: starter
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> It would be nice to have a developers' API for dataset transformations that 
> need more than one column from a row (ie UnaryTransformer inputs one column 
> and outputs one column) or that contain objects too expensive to initialize 
> repeatedly in a UDF such as a database connection. 
>  
> Design:
> Abstract class PartitionTransformer extends Transformer and defines the 
> partition transformation function as Iterator[Row] => Iterator[Row]
> NB: This parallels the UnaryTransformer createTransformFunc method
>  
> When developers subclass this transformer, they can provide their own schema 
> for the output Row in which case the PartitionTransformer creates a row 
> encoder and executes the transformation. Alternatively the developer can set 
> output Datatype and output col name. Then the PartitionTransformer class will 
> create a new schema, a row encoder, and execute the transformation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer

2020-01-09 Thread Nick Afshartous (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010756#comment-17010756
 ] 

Nick Afshartous edited comment on SPARK-27249 at 1/9/20 7:49 PM:
-

I could try and look into this.  Could someone validate that this feature is 
still needed ?  

[~enrush] It would also be helpful if you could provide a code example 
illustrating how the {{PartitionTransformer}} would be used.


was (Author: nafshartous):
I could try and look into this.  Could someone validate that this feature is 
still needed ?  

> Developers API for Transformers beyond UnaryTransformer
> ---
>
> Key: SPARK-27249
> URL: https://issues.apache.org/jira/browse/SPARK-27249
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Everett Rush
>Priority: Minor
>  Labels: starter
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> It would be nice to have a developers' API for dataset transformations that 
> need more than one column from a row (ie UnaryTransformer inputs one column 
> and outputs one column) or that contain objects too expensive to initialize 
> repeatedly in a UDF such as a database connection. 
>  
> Design:
> Abstract class PartitionTransformer extends Transformer and defines the 
> partition transformation function as Iterator[Row] => Iterator[Row]
> NB: This parallels the UnaryTransformer createTransformFunc method
>  
> When developers subclass this transformer, they can provide their own schema 
> for the output Row in which case the PartitionTransformer creates a row 
> encoder and executes the transformation. Alternatively the developer can set 
> output Datatype and output col name. Then the PartitionTransformer class will 
> create a new schema, a row encoder, and execute the transformation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30459) Fix ignoreMissingFiles/ignoreCorruptFiles in DSv2

2020-01-09 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-30459.

Resolution: Fixed

This issue is resolved in https://github.com/apache/spark/pull/27136

> Fix ignoreMissingFiles/ignoreCorruptFiles in DSv2
> -
>
> Key: SPARK-30459
> URL: https://issues.apache.org/jira/browse/SPARK-30459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> ignoreMissingFiles/ignoreCorruptFiles in DSv2 is wrong comparing to DSv1, as 
> it stop immediately once it finds a missing or corrupt file while in DSv1 it 
> will skip and continue to read next files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30459) Fix ignoreMissingFiles/ignoreCorruptFiles in DSv2

2020-01-09 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-30459:
---
Issue Type: Bug  (was: Improvement)

> Fix ignoreMissingFiles/ignoreCorruptFiles in DSv2
> -
>
> Key: SPARK-30459
> URL: https://issues.apache.org/jira/browse/SPARK-30459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> ignoreMissingFiles/ignoreCorruptFiles in DSv2 is wrong comparing to DSv1, as 
> it stop immediately once it finds a missing or corrupt file while in DSv1 it 
> will skip and continue to read next files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30459) Fix ignoreMissingFiles/ignoreCorruptFiles in DSv2

2020-01-09 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-30459:
--

Assignee: wuyi

> Fix ignoreMissingFiles/ignoreCorruptFiles in DSv2
> -
>
> Key: SPARK-30459
> URL: https://issues.apache.org/jira/browse/SPARK-30459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> ignoreMissingFiles/ignoreCorruptFiles in DSv2 is wrong comparing to DSv1, as 
> it stop immediately once it finds a missing or corrupt file while in DSv1 it 
> will skip and continue to read next files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29219) DataSourceV2: Support all SaveModes in DataFrameWriter.save

2020-01-09 Thread Burak Yavuz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-29219.
-
Fix Version/s: 3.0.0
   Resolution: Done

Resolved by [https://github.com/apache/spark/pull/26913]

> DataSourceV2: Support all SaveModes in DataFrameWriter.save
> ---
>
> Key: SPARK-29219
> URL: https://issues.apache.org/jira/browse/SPARK-29219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> We currently don't support all save modes in DataFrameWriter.save as the 
> TableProvider interface allows for the reading/writing of data, but not for 
> the creation of tables. We created a catalog API to support the 
> creation/dropping/checking existence of tables, but DataFrameWriter.save 
> doesn't necessarily use a catalog for example, when writing to a path based 
> table.
> For this case, we propose a new interface that will allow TableProviders to 
> extract an Indentifier and a Catalog from a bundle of 
> CaseInsensitiveStringOptions. This information can then be used to check the 
> existence of a table, and support all save modes. If a Catalog is not 
> defined, then the behavior is to use the spark_catalog (or configured session 
> catalog) to perform the check.
>  
> The interface can look like:
> {code:java}
> trait CatalogOptions {
>   def extractCatalog(StringMap): String
>   def extractIdentifier(StringMap): Identifier
> } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29219) DataSourceV2: Support all SaveModes in DataFrameWriter.save

2020-01-09 Thread Burak Yavuz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-29219:
---

Assignee: Burak Yavuz

> DataSourceV2: Support all SaveModes in DataFrameWriter.save
> ---
>
> Key: SPARK-29219
> URL: https://issues.apache.org/jira/browse/SPARK-29219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> We currently don't support all save modes in DataFrameWriter.save as the 
> TableProvider interface allows for the reading/writing of data, but not for 
> the creation of tables. We created a catalog API to support the 
> creation/dropping/checking existence of tables, but DataFrameWriter.save 
> doesn't necessarily use a catalog for example, when writing to a path based 
> table.
> For this case, we propose a new interface that will allow TableProviders to 
> extract an Indentifier and a Catalog from a bundle of 
> CaseInsensitiveStringOptions. This information can then be used to check the 
> existence of a table, and support all save modes. If a Catalog is not 
> defined, then the behavior is to use the spark_catalog (or configured session 
> catalog) to perform the check.
>  
> The interface can look like:
> {code:java}
> trait CatalogOptions {
>   def extractCatalog(StringMap): String
>   def extractIdentifier(StringMap): Identifier
> } {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30474) Writing data to parquet with dynamic partition should not be done in commit job stage

2020-01-09 Thread Zaisheng Dai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zaisheng Dai updated SPARK-30474:
-
Description: 
In the current spark implementation if you set,
{code:java}
spark.sql.sources.partitionOverwriteMode=dynamic
{code}
even with 
{code:java}
mapreduce.fileoutputcommitter.algorithm.version=2
{code}
it would still rename the partition folder *sequentially* in commitJob stage as 
shown here: 

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188]
 
[https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184]

 

This is very slow in cloud storage. We should commit the data similar to 
FileOutputCommitter v2?

 

  was:
In the current spark implementation if you set 
spark.sql.sources.partitionOverwriteMode=dynamic, even with 
mapreduce.fileoutputcommitter.algorithm.version=2, it would still rename the 
partition folder *sequentially* in commitJob stage as shown here: 

[|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188]
 
[https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184]

 

This is very slow in cloud storage. We should commit the data similar to 
FileOutputCommitter v2?

 


> Writing data to parquet with dynamic partition should not be done in commit 
> job stage
> -
>
> Key: SPARK-30474
> URL: https://issues.apache.org/jira/browse/SPARK-30474
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Zaisheng Dai
>Priority: Minor
>
> In the current spark implementation if you set,
> {code:java}
> spark.sql.sources.partitionOverwriteMode=dynamic
> {code}
> even with 
> {code:java}
> mapreduce.fileoutputcommitter.algorithm.version=2
> {code}
> it would still rename the partition folder *sequentially* in commitJob stage 
> as shown here: 
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188]
>  
> [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184]
>  
> This is very slow in cloud storage. We should commit the data similar to 
> FileOutputCommitter v2?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30474) Writing data to parquet with dynamic partition should not be done in commit job stage

2020-01-09 Thread Zaisheng Dai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zaisheng Dai updated SPARK-30474:
-
Description: 
In the current spark implementation if you set 
spark.sql.sources.partitionOverwriteMode=dynamic, even with 
mapreduce.fileoutputcommitter.algorithm.version=2, it would still rename the 
partition folder *sequentially* in commitJob stage as shown here: 

[|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188]
 
[https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184]

 

This is very slow in cloud storage. We should commit the data similar to 
FileOutputCommitter v2?

 

  was:
In the current spark implementation if you set 
spark.sql.sources.partitionOverwriteMode=dynamic, even with 
mapreduce.fileoutputcommitter.algorithm.version=2, it would still rename the 
partition folder *sequentially* in commitJob stage as shown here: 

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188]

 

This is very slow in cloud storage. We should commit the data similar to 
FileOutputCommitter v2?

 


> Writing data to parquet with dynamic partition should not be done in commit 
> job stage
> -
>
> Key: SPARK-30474
> URL: https://issues.apache.org/jira/browse/SPARK-30474
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Zaisheng Dai
>Priority: Minor
>
> In the current spark implementation if you set 
> spark.sql.sources.partitionOverwriteMode=dynamic, even with 
> mapreduce.fileoutputcommitter.algorithm.version=2, it would still rename the 
> partition folder *sequentially* in commitJob stage as shown here: 
> [|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188]
>  
> [https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L184]
>  
> This is very slow in cloud storage. We should commit the data similar to 
> FileOutputCommitter v2?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30474) Writing data to parquet with dynamic partition should not be done in commit job stage

2020-01-09 Thread Zaisheng Dai (Jira)

Zaisheng Dai created SPARK-30474:


 Summary: Writing data to parquet with dynamic partition should not 
be done in commit job stage
 Key: SPARK-30474
 URL: https://issues.apache.org/jira/browse/SPARK-30474
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 2.4.4, 2.3.4
Reporter: Zaisheng Dai


In the current spark implementation if you set 
spark.sql.sources.partitionOverwriteMode=dynamic, even with 
mapreduce.fileoutputcommitter.algorithm.version=2, it would still rename the 
partition folder *sequentially* in commitJob stage as shown here: 

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L188]

 

This is very slow in cloud storage. We should commit the data similar to 
FileOutputCommitter v2?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30467) On Federal Information Processing Standard (FIPS) enabled cluster, Spark Workers are not able to connect to Remote Master.

2020-01-09 Thread SHOBHIT SHUKLA (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHOBHIT SHUKLA updated SPARK-30467:
---
Description: 
On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark 
Workers are not able to create Spark Context because communication between 
Spark Worker and Spark Master is not possible If we configured 
*spark.network.crypto.enabled true*.

*Error logs :*

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
*fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST 
FAILURE*
JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - 
please wait.
JVMDUMP032I JVM requested System dump using 
'/bin/core.20200109.064150.283.0001.dmp' in response to an event
JVMDUMP030W Cannot write dump to 
file/bin/core.20200109.064150.283.0001.dmp: Permission denied
JVMDUMP012E Error in System dump: The core file created by child process with 
pid = 375 was not found. Expected to find core file with name 
"/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*"
JVMDUMP030W Cannot write dump to file 
/bin/javacore.20200109.064150.283.0002.txt: Permission denied
JVMDUMP032I JVM requested Java dump using 
'/tmp/javacore.20200109.064150.283.0002.txt' in response to an event
JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt
JVMDUMP032I JVM requested Snap dump using 
'/bin/Snap.20200109.064150.283.0003.trc' in response to an event
JVMDUMP030W Cannot write dump to file 
/bin/Snap.20200109.064150.283.0003.trc: Permission denied
JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc
JVMDUMP030W Cannot write dump to file 
/bin/jitdump.20200109.064150.283.0004.dmp: Permission denied
JVMDUMP007I JVM Requesting JIT dump using 
'/tmp/jitdump.20200109.064150.283.0004.dmp'
JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp
JVMDUMP013I Processed dump event "abort", detail "".

  was:
On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark 
Workers are not able to create Spark Context because communication between 
Spark Worker and Spark Master is not possible If we configured 
*spark.network.crypto.enabled true*.

*Error logs :*

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST 
FAILURE
JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - 
please wait.
JVMDUMP032I JVM requested System dump using 
'/bin/core.20200109.064150.283.0001.dmp' in response to an event
JVMDUMP030W Cannot write dump to 
file/bin/core.20200109.064150.283.0001.dmp: Permission denied
JVMDUMP012E Error in System dump: The core file created by child process with 
pid = 375 was not found. Expected to find core file with name 
"/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*"
JVMDUMP030W Cannot write dump to file 
/bin/javacore.20200109.064150.283.0002.txt: Permission denied
JVMDUMP032I JVM requested Java dump using 
'/tmp/javacore.20200109.064150.283.0002.txt' in response to an event
JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt
JVMDUMP032I JVM requested Snap dump using 
'/bin/Snap.20200109.064150.283.0003.trc' in response to an event
JVMDUMP030W Cannot write dump to file 
/bin/Snap.20200109.064150.283.0003.trc: Permission denied
JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc
JVMDUMP030W Cannot write dump to file 
/bin/jitdump.20200109.064150.283.0004.dmp: Permission denied
JVMDUMP007I JVM Requesting JIT dump using 
'/tmp/jitdump.20200109.064150.283.0004.dmp'
JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp
JVMDUMP013I Processed dump event "abort", detail "".


> On Federal Information Processing Standard (FIPS) enabled cluster, Spark 
> Workers are not able to connect to Remote Master.
> --
>
> Key: SPARK-30467
> URL: https://issues.apache.org/jira/browse/SPARK-30467
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.3.4, 2.4.4
>Reporter: SHOBHIT SHUKLA
>Priority: Blocker
>  Labels: security
>
> On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark 
> Workers are not able to create Spark Context because communication between 
> Spark Worker and Spark Master is not possible If we configured 
> *spark.network.crypto.enabled true*.
> *Error logs :*
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> *fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST 
> FAILURE*
> JVMDU

[jira] [Updated] (SPARK-30467) On Federal Information Processing Standard (FIPS) enabled cluster, Spark Workers are not able to connect to Remote Master.

2020-01-09 Thread SHOBHIT SHUKLA (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHOBHIT SHUKLA updated SPARK-30467:
---
Priority: Blocker  (was: Major)

> On Federal Information Processing Standard (FIPS) enabled cluster, Spark 
> Workers are not able to connect to Remote Master.
> --
>
> Key: SPARK-30467
> URL: https://issues.apache.org/jira/browse/SPARK-30467
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.3.4, 2.4.4
>Reporter: SHOBHIT SHUKLA
>Priority: Blocker
>  Labels: security
>
> On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark 
> Workers are not able to create Spark Context because communication between 
> Spark Worker and Spark Master is not possible If we configured 
> *spark.network.crypto.enabled true*.
> *Error logs :*
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST 
> FAILURE
> JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - 
> please wait.
> JVMDUMP032I JVM requested System dump using 
> '/bin/core.20200109.064150.283.0001.dmp' in response to an event
> JVMDUMP030W Cannot write dump to 
> file/bin/core.20200109.064150.283.0001.dmp: Permission denied
> JVMDUMP012E Error in System dump: The core file created by child process with 
> pid = 375 was not found. Expected to find core file with name 
> "/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*"
> JVMDUMP030W Cannot write dump to file 
> /bin/javacore.20200109.064150.283.0002.txt: Permission denied
> JVMDUMP032I JVM requested Java dump using 
> '/tmp/javacore.20200109.064150.283.0002.txt' in response to an event
> JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt
> JVMDUMP032I JVM requested Snap dump using 
> '/bin/Snap.20200109.064150.283.0003.trc' in response to an event
> JVMDUMP030W Cannot write dump to file 
> /bin/Snap.20200109.064150.283.0003.trc: Permission denied
> JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc
> JVMDUMP030W Cannot write dump to file 
> /bin/jitdump.20200109.064150.283.0004.dmp: Permission denied
> JVMDUMP007I JVM Requesting JIT dump using 
> '/tmp/jitdump.20200109.064150.283.0004.dmp'
> JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp
> JVMDUMP013I Processed dump event "abort", detail "".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30473) PySpark enum subclass crashes when used inside UDF

2020-01-09 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-30473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Härtwig updated SPARK-30473:

Description: 
PySpark enum subclass crashes when used inside a UDF.

 

Example:
{code:java}
from enum import Enum
class Direction(Enum):
    NORTH = 0
    SOUTH = 1
{code}
 

Working:
{code:java}
Direction.NORTH{code}
 

Crashing:
{code:java}
@udf
def fn(a):
Direction.NORTH
return ""

df.withColumn("test", fn("a")){code}
 

Stacktrace:
{noformat}
SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 
times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 10.139.64.21, 
executor 0): org.apache.spark.api.python.PythonException: Traceback (most 
recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 182, in 
_read_with_length return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads 
return pickle.loads(obj, encoding=encoding)
File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ 
enum_members = {k: classdict[k] for k in classdict._member_names}
AttributeError: 'dict' object has no attribute '_member_names'{noformat}
 

I suspect the problem is in *python/pyspark/cloudpickle.py*. On line 586 in the 
function *_save_dynamic_enum*, the attribute *_member_names* is removed from 
the enum. Yet, this attribute is required by the *Enum* class. This results in 
all Enum subclasses crashing.

  was:
PySpark enum subclass crashes when used inside a UDF.

 

Example:
{code:java}
from enum import Enum
class Direction(Enum):
    NORTH = 0
    SOUTH = 1
{code}
 

Working:
{code:java}
Direction.NORTH{code}
 

Crashing:
{code:java}
@udf
def fn(a):
Direction.NORTH
return ""

df.withColumn("test", fn("a")){code}
 

Stacktrace:
{noformat}
SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 
times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 10.139.64.21, 
executor 0): org.apache.spark.api.python.PythonException: Traceback (most 
recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 182, in 
_read_with_length return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads 
return pickle.loads(obj, encoding=encoding)
File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ 
enum_members = {k: classdict[k] for k in classdict._member_names}
AttributeError: 'dict' object has no attribute '_member_names'{noformat}
 

I suspect the problem is in `python/pyspark/cloudpickle.py`. On line 586 in the 
function `_save_dynamic_enum`, the attribute `_member_names` is removed from 
the enum. Yet, this attribute is required by the `Enum` class and Enum 
subclasses will crash.


> PySpark enum subclass crashes when used inside UDF
> --
>
> Key: SPARK-30473
> URL: https://issues.apache.org/jira/browse/SPARK-30473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
> Environment: Databricks Runtime 6.2 (includes Apache Spark 2.4.4, 
> Scala 2.11)
>Reporter: Max Härtwig
>Priority: Major
>
> PySpark enum subclass crashes when used inside a UDF.
>  
> Example:
> {code:java}
> from enum import Enum
> class Direction(Enum):
>     NORTH = 0
>     SOUTH = 1
> {code}
>  
> Working:
> {code:java}
> Direction.NORTH{code}
>  
> Crashing:
> {code:java}
> @udf
> def fn(a):
> Direction.NORTH
> return ""
> df.withColumn("test", fn("a")){code}
>  
> Stacktrace:
> {noformat}
> SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 
> 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 
> 10.139.64.21, executor 0): org.apache.spark.api.python.PythonException: 
> Traceback (most recent call last):
> File "/databricks/spark/python/pyspark/serializers.py", line 182, in 
> _read_with_length return self.loads(obj)
> File "/databricks/spark/python/pyspark/serializers.py", line 695, in 
> loads return pickle.loads(obj, encoding=encoding)
> File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ 
> enum_members = {k: classdict[k] for k in classdict._member_names}
> AttributeError: 'dict' object has no attribute '_member_names'{noformat}
>  
> I suspect the problem is in *python/pyspark/cloudpickle.py*. On line 586 in 
> the function *_save_dynamic_enum*, the attribute *_member_names* is removed 
> from the enum. Yet, this attribute is required by the *Enum* class. This 
> results in all Enum subclasses crashing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30473) PySpark enum subclass crashes when used inside UDF

2020-01-09 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-30473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Härtwig updated SPARK-30473:

Description: 
PySpark enum subclass crashes when used inside a UDF.

 

Example:
{code:java}
from enum import Enum
class Direction(Enum):
    NORTH = 0
    SOUTH = 1
{code}
 

Working:
{code:java}
Direction.NORTH{code}
 

Crashing:
{code:java}
@udf
def fn(a):
Direction.NORTH
return ""

df.withColumn("test", fn("a")){code}
 

Stacktrace:
{noformat}
SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 
times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 10.139.64.21, 
executor 0): org.apache.spark.api.python.PythonException: Traceback (most 
recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 182, in 
_read_with_length return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 695, in loads 
return pickle.loads(obj, encoding=encoding)
File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ 
enum_members = {k: classdict[k] for k in classdict._member_names}
AttributeError: 'dict' object has no attribute '_member_names'{noformat}
 

I suspect the problem is in `python/pyspark/cloudpickle.py`. On line 586 in the 
function `_save_dynamic_enum`, the attribute `_member_names` is removed from 
the enum. Yet, this attribute is required by the `Enum` class and Enum 
subclasses will crash.

  was:
PySpark enum subclass crashes when used inside a UDF.

 

Example:

 
{code:java}
from enum import Enum
class Direction(Enum):
    NORTH = 0
    SOUTH = 1
{code}
 

Working:

 
{code:java}
Direction.NORTH{code}
 

 

Crashing:

 
{code:java}
@udf
def fn(a):
Direction.NORTH
return ""

df.withColumn("test", fn("a")){code}
 

 

Stacktrace:

 
{noformat}
SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 
times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 10.139.64.21, 
executor 0): org.apache.spark.api.python.PythonException: Traceback (most 
recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 
182, in _read_with_length return self.loads(obj) File 
"/databricks/spark/python/pyspark/serializers.py", line 695, in loads return 
pickle.loads(obj, encoding=encoding) File 
"/databricks/python/lib/python3.7/enum.py", line 152, in __new__ enum_members = 
{k: classdict[k] for k in classdict._member_names} AttributeError: 'dict' 
object has no attribute '_member_names'{noformat}
 

 

I suspect the problem is in `python/pyspark/cloudpickle.py`. On line 586 in the 
function `_save_dynamic_enum`, the attribute `_member_names` is removed from 
the enum. Yet, this attribute is required by the `Enum` class and Enum 
subclasses will crash.

 


> PySpark enum subclass crashes when used inside UDF
> --
>
> Key: SPARK-30473
> URL: https://issues.apache.org/jira/browse/SPARK-30473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
> Environment: Databricks Runtime 6.2 (includes Apache Spark 2.4.4, 
> Scala 2.11)
>Reporter: Max Härtwig
>Priority: Major
>
> PySpark enum subclass crashes when used inside a UDF.
>  
> Example:
> {code:java}
> from enum import Enum
> class Direction(Enum):
>     NORTH = 0
>     SOUTH = 1
> {code}
>  
> Working:
> {code:java}
> Direction.NORTH{code}
>  
> Crashing:
> {code:java}
> @udf
> def fn(a):
> Direction.NORTH
> return ""
> df.withColumn("test", fn("a")){code}
>  
> Stacktrace:
> {noformat}
> SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 
> 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 
> 10.139.64.21, executor 0): org.apache.spark.api.python.PythonException: 
> Traceback (most recent call last):
> File "/databricks/spark/python/pyspark/serializers.py", line 182, in 
> _read_with_length return self.loads(obj)
> File "/databricks/spark/python/pyspark/serializers.py", line 695, in 
> loads return pickle.loads(obj, encoding=encoding)
> File "/databricks/python/lib/python3.7/enum.py", line 152, in __new__ 
> enum_members = {k: classdict[k] for k in classdict._member_names}
> AttributeError: 'dict' object has no attribute '_member_names'{noformat}
>  
> I suspect the problem is in `python/pyspark/cloudpickle.py`. On line 586 in 
> the function `_save_dynamic_enum`, the attribute `_member_names` is removed 
> from the enum. Yet, this attribute is required by the `Enum` class and Enum 
> subclasses will crash.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30473) PySpark enum subclass crashes when used inside UDF

2020-01-09 Thread Jira

Max Härtwig created SPARK-30473:
---

 Summary: PySpark enum subclass crashes when used inside UDF
 Key: SPARK-30473
 URL: https://issues.apache.org/jira/browse/SPARK-30473
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.4
 Environment: Databricks Runtime 6.2 (includes Apache Spark 2.4.4, 
Scala 2.11)
Reporter: Max Härtwig


PySpark enum subclass crashes when used inside a UDF.

 

Example:

 
{code:java}
from enum import Enum
class Direction(Enum):
    NORTH = 0
    SOUTH = 1
{code}
 

Working:

 
{code:java}
Direction.NORTH{code}
 

 

Crashing:

 
{code:java}
@udf
def fn(a):
Direction.NORTH
return ""

df.withColumn("test", fn("a")){code}
 

 

Stacktrace:

 
{noformat}
SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 
times, most recent failure: Lost task 0.3 in stage 9.0 (TID 235, 10.139.64.21, 
executor 0): org.apache.spark.api.python.PythonException: Traceback (most 
recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 
182, in _read_with_length return self.loads(obj) File 
"/databricks/spark/python/pyspark/serializers.py", line 695, in loads return 
pickle.loads(obj, encoding=encoding) File 
"/databricks/python/lib/python3.7/enum.py", line 152, in __new__ enum_members = 
{k: classdict[k] for k in classdict._member_names} AttributeError: 'dict' 
object has no attribute '_member_names'{noformat}
 

 

I suspect the problem is in `python/pyspark/cloudpickle.py`. On line 586 in the 
function `_save_dynamic_enum`, the attribute `_member_names` is removed from 
the enum. Yet, this attribute is required by the `Enum` class and Enum 
subclasses will crash.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30472) [SQL] ANSI SQL: Throw exception on format invalid and overflow when casting String to IntegerType.

2020-01-09 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-30472:

Summary: [SQL] ANSI SQL: Throw exception on format invalid and overflow 
when casting String to IntegerType.  (was: ANSI SQL: Cast String to Integer 
Type, throw exception on format invalid and overflow.)

> [SQL] ANSI SQL: Throw exception on format invalid and overflow when casting 
> String to IntegerType.
> --
>
> Key: SPARK-30472
> URL: https://issues.apache.org/jira/browse/SPARK-30472
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: feiwang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30467) On Federal Information Processing Standard (FIPS) enabled cluster, Spark Workers are not able to connect to Remote Master.

2020-01-09 Thread SHOBHIT SHUKLA (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHOBHIT SHUKLA updated SPARK-30467:
---
Description: 
On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark 
Workers are not able to create Spark Context because communication between 
Spark Worker and Spark Master is not possible If we configured 
*spark.network.crypto.enabled true*.

*Error logs :*

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST 
FAILURE
JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - 
please wait.
JVMDUMP032I JVM requested System dump using 
'/bin/core.20200109.064150.283.0001.dmp' in response to an event
JVMDUMP030W Cannot write dump to 
file/bin/core.20200109.064150.283.0001.dmp: Permission denied
JVMDUMP012E Error in System dump: The core file created by child process with 
pid = 375 was not found. Expected to find core file with name 
"/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*"
JVMDUMP030W Cannot write dump to file 
/bin/javacore.20200109.064150.283.0002.txt: Permission denied
JVMDUMP032I JVM requested Java dump using 
'/tmp/javacore.20200109.064150.283.0002.txt' in response to an event
JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt
JVMDUMP032I JVM requested Snap dump using 
'/bin/Snap.20200109.064150.283.0003.trc' in response to an event
JVMDUMP030W Cannot write dump to file 
/bin/Snap.20200109.064150.283.0003.trc: Permission denied
JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc
JVMDUMP030W Cannot write dump to file 
/bin/jitdump.20200109.064150.283.0004.dmp: Permission denied
JVMDUMP007I JVM Requesting JIT dump using 
'/tmp/jitdump.20200109.064150.283.0004.dmp'
JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp
JVMDUMP013I Processed dump event "abort", detail "".

  was:
On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark 
Workers are not able to create Spark Context because communication between 
worker and master is not possible If we configured 
*spark.network.crypto.enabled true*.

*Error logs :*

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST 
FAILURE
JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - 
please wait.
JVMDUMP032I JVM requested System dump using 
'/bin/core.20200109.064150.283.0001.dmp' in response to an event
JVMDUMP030W Cannot write dump to 
file/bin/core.20200109.064150.283.0001.dmp: Permission denied
JVMDUMP012E Error in System dump: The core file created by child process with 
pid = 375 was not found. Expected to find core file with name 
"/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*"
JVMDUMP030W Cannot write dump to file 
/bin/javacore.20200109.064150.283.0002.txt: Permission denied
JVMDUMP032I JVM requested Java dump using 
'/tmp/javacore.20200109.064150.283.0002.txt' in response to an event
JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt
JVMDUMP032I JVM requested Snap dump using 
'/bin/Snap.20200109.064150.283.0003.trc' in response to an event
JVMDUMP030W Cannot write dump to file 
/bin/Snap.20200109.064150.283.0003.trc: Permission denied
JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc
JVMDUMP030W Cannot write dump to file 
/bin/jitdump.20200109.064150.283.0004.dmp: Permission denied
JVMDUMP007I JVM Requesting JIT dump using 
'/tmp/jitdump.20200109.064150.283.0004.dmp'
JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp
JVMDUMP013I Processed dump event "abort", detail "".


> On Federal Information Processing Standard (FIPS) enabled cluster, Spark 
> Workers are not able to connect to Remote Master.
> --
>
> Key: SPARK-30467
> URL: https://issues.apache.org/jira/browse/SPARK-30467
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.3.4, 2.4.4
>Reporter: SHOBHIT SHUKLA
>Priority: Major
>  Labels: security
>
> On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark 
> Workers are not able to create Spark Context because communication between 
> Spark Worker and Spark Master is not possible If we configured 
> *spark.network.crypto.enabled true*.
> *Error logs :*
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST 
> FAILURE
> JVMDUMP039I Processing

[jira] [Updated] (SPARK-30467) On Federal Information Processing Standard (FIPS) enabled cluster, Spark Workers are not able to connect to Remote Master.

2020-01-09 Thread SHOBHIT SHUKLA (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHOBHIT SHUKLA updated SPARK-30467:
---
Description: 
On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark 
Workers are not able to create Spark Context because communication between 
worker and master is not possible If we configured 
*spark.network.crypto.enabled true*.

*Error logs :*

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST 
FAILURE
JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - 
please wait.
JVMDUMP032I JVM requested System dump using 
'/bin/core.20200109.064150.283.0001.dmp' in response to an event
JVMDUMP030W Cannot write dump to 
file/bin/core.20200109.064150.283.0001.dmp: Permission denied
JVMDUMP012E Error in System dump: The core file created by child process with 
pid = 375 was not found. Expected to find core file with name 
"/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*"
JVMDUMP030W Cannot write dump to file 
/bin/javacore.20200109.064150.283.0002.txt: Permission denied
JVMDUMP032I JVM requested Java dump using 
'/tmp/javacore.20200109.064150.283.0002.txt' in response to an event
JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt
JVMDUMP032I JVM requested Snap dump using 
'/bin/Snap.20200109.064150.283.0003.trc' in response to an event
JVMDUMP030W Cannot write dump to file 
/bin/Snap.20200109.064150.283.0003.trc: Permission denied
JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc
JVMDUMP030W Cannot write dump to file 
/bin/jitdump.20200109.064150.283.0004.dmp: Permission denied
JVMDUMP007I JVM Requesting JIT dump using 
'/tmp/jitdump.20200109.064150.283.0004.dmp'
JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp
JVMDUMP013I Processed dump event "abort", detail "".

  was:
On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark 
Workers are not able to create Spark Context If we configured 
*spark.network.crypto.enabled true*.

*Error logs :*

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST 
FAILURE
JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - 
please wait.
JVMDUMP032I JVM requested System dump using 
'/bin/core.20200109.064150.283.0001.dmp' in response to an event
JVMDUMP030W Cannot write dump to 
file/bin/core.20200109.064150.283.0001.dmp: Permission denied
JVMDUMP012E Error in System dump: The core file created by child process with 
pid = 375 was not found. Expected to find core file with name 
"/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*"
JVMDUMP030W Cannot write dump to file 
/bin/javacore.20200109.064150.283.0002.txt: Permission denied
JVMDUMP032I JVM requested Java dump using 
'/tmp/javacore.20200109.064150.283.0002.txt' in response to an event
JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt
JVMDUMP032I JVM requested Snap dump using 
'/bin/Snap.20200109.064150.283.0003.trc' in response to an event
JVMDUMP030W Cannot write dump to file 
/bin/Snap.20200109.064150.283.0003.trc: Permission denied
JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc
JVMDUMP030W Cannot write dump to file 
/bin/jitdump.20200109.064150.283.0004.dmp: Permission denied
JVMDUMP007I JVM Requesting JIT dump using 
'/tmp/jitdump.20200109.064150.283.0004.dmp'
JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp
JVMDUMP013I Processed dump event "abort", detail "".


> On Federal Information Processing Standard (FIPS) enabled cluster, Spark 
> Workers are not able to connect to Remote Master.
> --
>
> Key: SPARK-30467
> URL: https://issues.apache.org/jira/browse/SPARK-30467
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.3.4, 2.4.4
>Reporter: SHOBHIT SHUKLA
>Priority: Major
>  Labels: security
>
> On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark 
> Workers are not able to create Spark Context because communication between 
> worker and master is not possible If we configured 
> *spark.network.crypto.enabled true*.
> *Error logs :*
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST 
> FAILURE
> JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - 
> please wait.
> JVMDUMP032I JVM

[jira] [Updated] (SPARK-30467) On Federal Information Processing Standard (FIPS) enabled cluster, Spark Workers are not able to connect to Remote Master.

2020-01-09 Thread SHOBHIT SHUKLA (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHOBHIT SHUKLA updated SPARK-30467:
---
Summary: On Federal Information Processing Standard (FIPS) enabled cluster, 
Spark Workers are not able to connect to Remote Master.  (was: On Federal 
Information Processing Standard (FIPS) enabled cluster, Spark Workers are not 
able to connect to Master.)

> On Federal Information Processing Standard (FIPS) enabled cluster, Spark 
> Workers are not able to connect to Remote Master.
> --
>
> Key: SPARK-30467
> URL: https://issues.apache.org/jira/browse/SPARK-30467
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.3.4, 2.4.4
>Reporter: SHOBHIT SHUKLA
>Priority: Major
>  Labels: security
>
> On _*Federal Information Processing Standard*_ (FIPS) enabled clusters, Spark 
> Workers are not able to create Spark Context If we configured 
> *spark.network.crypto.enabled true*.
> *Error logs :*
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> fips.c(145): OpenSSL internal error, assertion failed: FATAL FIPS SELFTEST 
> FAILURE
> JVMDUMP039I Processing dump event "abort", detail "" at 2020/01/09 06:41:50 - 
> please wait.
> JVMDUMP032I JVM requested System dump using 
> '/bin/core.20200109.064150.283.0001.dmp' in response to an event
> JVMDUMP030W Cannot write dump to 
> file/bin/core.20200109.064150.283.0001.dmp: Permission denied
> JVMDUMP012E Error in System dump: The core file created by child process with 
> pid = 375 was not found. Expected to find core file with name 
> "/var/cores/core-netty-rpc-conne-sig11-user1000320999-group0-pid375-time*"
> JVMDUMP030W Cannot write dump to file 
> /bin/javacore.20200109.064150.283.0002.txt: Permission denied
> JVMDUMP032I JVM requested Java dump using 
> '/tmp/javacore.20200109.064150.283.0002.txt' in response to an event
> JVMDUMP010I Java dump written to /tmp/javacore.20200109.064150.283.0002.txt
> JVMDUMP032I JVM requested Snap dump using 
> '/bin/Snap.20200109.064150.283.0003.trc' in response to an event
> JVMDUMP030W Cannot write dump to file 
> /bin/Snap.20200109.064150.283.0003.trc: Permission denied
> JVMDUMP010I Snap dump written to /tmp/Snap.20200109.064150.283.0003.trc
> JVMDUMP030W Cannot write dump to file 
> /bin/jitdump.20200109.064150.283.0004.dmp: Permission denied
> JVMDUMP007I JVM Requesting JIT dump using 
> '/tmp/jitdump.20200109.064150.283.0004.dmp'
> JVMDUMP010I JIT dump written to /tmp/jitdump.20200109.064150.283.0004.dmp
> JVMDUMP013I Processed dump event "abort", detail "".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30452) Add predict and numFeatures in Python IsotonicRegressionModel

2020-01-09 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30452.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27122
[https://github.com/apache/spark/pull/27122]

> Add predict and numFeatures in Python IsotonicRegressionModel
> -
>
> Key: SPARK-30452
> URL: https://issues.apache.org/jira/browse/SPARK-30452
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Trivial
> Fix For: 3.0.0
>
>
> Since IsotonicRegressionModel doesn't extend JavaPredictionModel, predict and 
> numFeatures need to be added explicitly. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30452) Add predict and numFeatures in Python IsotonicRegressionModel

2020-01-09 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30452:


Assignee: Huaxin Gao

> Add predict and numFeatures in Python IsotonicRegressionModel
> -
>
> Key: SPARK-30452
> URL: https://issues.apache.org/jira/browse/SPARK-30452
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Trivial
>
> Since IsotonicRegressionModel doesn't extend JavaPredictionModel, predict and 
> numFeatures need to be added explicitly. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering

2020-01-09 Thread Tobias Hermann (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011941#comment-17011941
 ] 

Tobias Hermann commented on SPARK-30421:


It allows you to write code that should break but does not. Then, later, 
somebody does an innocent and totally valid refactoring, and suddenly the code 
is broken, and this poor person goes crazy. ;)

> Dropped columns still available for filtering
> -
>
> Key: SPARK-30421
> URL: https://issues.apache.org/jira/browse/SPARK-30421
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Tobias Hermann
>Priority: Minor
>
> The following minimal example:
> {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
> df.select("foo").where($"bar" === "a").show
> df.drop("bar").where($"bar" === "a").show
> {quote}
> should result in an error like the following:
> {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given 
> input columns: [foo];
> {quote}
> However, it does not but instead works without error, as if the column "bar" 
> would exist.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30472) ANSI SQL: Cast String to Integer Type, throw exception on format invalid and overflow.

2020-01-09 Thread feiwang (Jira)

feiwang created SPARK-30472:
---

 Summary: ANSI SQL: Cast String to Integer Type, throw exception on 
format invalid and overflow.
 Key: SPARK-30472
 URL: https://issues.apache.org/jira/browse/SPARK-30472
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30421) Dropped columns still available for filtering

2020-01-09 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011930#comment-17011930
 ] 

Wenchen Fan commented on SPARK-30421:
-

but it will not break anything, right? It just gives more chances to let your 
query compile.

> Dropped columns still available for filtering
> -
>
> Key: SPARK-30421
> URL: https://issues.apache.org/jira/browse/SPARK-30421
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Tobias Hermann
>Priority: Minor
>
> The following minimal example:
> {quote}val df = Seq((0, "a"), (1, "b")).toDF("foo", "bar")
> df.select("foo").where($"bar" === "a").show
> df.drop("bar").where($"bar" === "a").show
> {quote}
> should result in an error like the following:
> {quote}org.apache.spark.sql.AnalysisException: cannot resolve '`bar`' given 
> input columns: [foo];
> {quote}
> However, it does not but instead works without error, as if the column "bar" 
> would exist.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-28317) Built-in Mathematical Functions: SCALE

2020-01-09 Thread Oleg Bonar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oleg Bonar updated SPARK-28317:
---
Comment: was deleted

(was: HI [~shivuson...@gmail.com]! Have you made any progress on the issue?)

> Built-in Mathematical Functions: SCALE
> --
>
> Key: SPARK-28317
> URL: https://issues.apache.org/jira/browse/SPARK-28317
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Return Type||Description||Example||Result||
> |{{scale(}}{{numeric}}{{)}}|{{integer}}|scale of the argument (the number of 
> decimal digits in the fractional part)|{{scale(8.41)}}|{{2}}|
> https://www.postgresql.org/docs/11/functions-math.html#FUNCTIONS-MATH-FUNC-TABLE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30428) File source V2: support partition pruning

2020-01-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30428.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27112
[https://github.com/apache/spark/pull/27112]

> File source V2: support partition pruning
> -
>
> Key: SPARK-30428
> URL: https://issues.apache.org/jira/browse/SPARK-30428
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30471) Fix issue when compare string and IntegerType

2020-01-09 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-30471:

Description: 
When we comparing a String Type and IntegerType:
'2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType).

Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) is 
IntegerType.

But the value of string may exceed Int.MaxValue, then the result is corruputed.


For example:
{code:java}
// Some comments here
CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS STRING)) 
 AS ta(id);
SELECT * FROM ta WHERE id > 0; // result is null
{code}


  was:
When we comparing a String Type and IntegerType:
'2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType).

Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) is 
IntegerType.

But the value of string may exceed Int.MaxValue, then the result is corruputed.


For example:
{code:java}
// Some comments here
 CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS 
STRING))  AS ta(id);
SELECT * FROM ta WHERE id > 0;
{code}



> Fix issue when compare string and IntegerType
> -
>
> Key: SPARK-30471
> URL: https://issues.apache.org/jira/browse/SPARK-30471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: feiwang
>Priority: Major
>
> When we comparing a String Type and IntegerType:
> '2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType).
> Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) 
> is IntegerType.
> But the value of string may exceed Int.MaxValue, then the result is 
> corruputed.
> For example:
> {code:java}
> // Some comments here
> CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS 
> STRING))  AS ta(id);
> SELECT * FROM ta WHERE id > 0; // result is null
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30471) Fix issue when compare string and IntegerType

2020-01-09 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-30471:

Description: 
When we comparing a String Type and IntegerType:
'2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType).

Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) is 
IntegerType.

But the value of string may exceed Int.MaxValue, then the result is corruputed.


For example:
{code:java}
// Some comments here
 CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS 
STRING))  AS ta(id);
SELECT * FROM ta WHERE id > 0;
{code}


> Fix issue when compare string and IntegerType
> -
>
> Key: SPARK-30471
> URL: https://issues.apache.org/jira/browse/SPARK-30471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: feiwang
>Priority: Major
>
> When we comparing a String Type and IntegerType:
> '2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType).
> Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) 
> is IntegerType.
> But the value of string may exceed Int.MaxValue, then the result is 
> corruputed.
> For example:
> {code:java}
> // Some comments here
>  CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS 
> STRING))  AS ta(id);
> SELECT * FROM ta WHERE id > 0;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10816) EventTime based sessionization

2020-01-09 Thread Rafat (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011732#comment-17011732
 ] 

Rafat commented on SPARK-10816:
---

same question above 

Is there SLA for this feature ?

Thanks 

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf, Session 
> Window Support For Structure Streaming.pdf
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30471) Fix issue when compare string and IntegerType

2020-01-09 Thread feiwang (Jira)

feiwang created SPARK-30471:
---

 Summary: Fix issue when compare string and IntegerType
 Key: SPARK-30471
 URL: https://issues.apache.org/jira/browse/SPARK-30471
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30470) Uncache table in tempViews if needed on session closed

2020-01-09 Thread liupengcheng (Jira)

liupengcheng created SPARK-30470:


 Summary: Uncache table in tempViews if needed on session closed
 Key: SPARK-30470
 URL: https://issues.apache.org/jira/browse/SPARK-30470
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.2
Reporter: liupengcheng


Currently, Spark will not cleanup cached tables in tempViews produced by sql 
like following

`CACHE TABLE table1 as SELECT `

There are risks that the `uncache table` not called due to session closed 
unexpectedly, or user closed manually. Then these temp views will lost, and we 
can not visit them in other session, but the cached plan still exists in the 
`CacheManager`.

Moreover, the leaks may cause the failure of the subsequent query, one failure 
we encoutered in our production environment is as below:
{code:java}
Caused by: java.io.FileNotFoundException: File does not exist: 
/user//xx/data__db60e76d_91b8_42f3_909d_5c68692ecdd4Caused by: 
java.io.FileNotFoundException: File does not exist: 
/user//xx/data__db60e76d_91b8_42f3_909d_5c68692ecdd4It is possible the 
underlying files have been updated. You can explicitly invalidate the cache in 
Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the 
Dataset/DataFrame involved. at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:131)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0.scan_nextBatch_0$(Unknown
 Source) at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage0.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
{code}
The above exception happens when user update the data of the table, but spark 
still use the old cached plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30469) Partition columns should not be involved when calculating sizeInBytes of Project logical plan

2020-01-09 Thread Hu Fuwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Fuwang updated SPARK-30469:
--
Description: 
When getting the statistics of a Project logical plan, if CBO not enabled, 
Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate the 
size in bytes, which will compute the ratio of the row size of the project plan 
and its child plan.

And the row size is computed based on the output attributes (columns). 
Currently, SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition 
columns of hive table as well, which is not reasonable, because partition 
columns actually does not account for sizeInBytes.

This may make the sizeInBytes not accurate.

  was:
When getting the statistics of a Project logical plan, if CBO not enabled, 
Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate the 
size in bytes, which will compute the ratio of the row size of the project plan 
and its child plan.

And the row size is computed based on the output attributes (columns). 
Currently, SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition 
columns of hive table as well, which is not reasonable, because hive partition 
column actually does not account for sizeInBytes.

This may make the sizeInBytes not accurate.


> Partition columns should not be involved when calculating sizeInBytes of 
> Project logical plan
> -
>
> Key: SPARK-30469
> URL: https://issues.apache.org/jira/browse/SPARK-30469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Priority: Major
>
> When getting the statistics of a Project logical plan, if CBO not enabled, 
> Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate 
> the size in bytes, which will compute the ratio of the row size of the 
> project plan and its child plan.
> And the row size is computed based on the output attributes (columns). 
> Currently, SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition 
> columns of hive table as well, which is not reasonable, because partition 
> columns actually does not account for sizeInBytes.
> This may make the sizeInBytes not accurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30469) Hive Partition columns should not be involved when calculating sizeInBytes of Project logical plan

2020-01-09 Thread Hu Fuwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Fuwang updated SPARK-30469:
--
Description: 
When getting the statistics of a Project logical plan, if CBO not enabled, 
Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate the 
size in bytes, which will compute the ratio of the row size of the project plan 
and its child plan.

And the row size is computed based on the output attributes (columns). 
Currently, SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition 
columns of hive table as well, which is not reasonable, because hive partition 
column actually does not account for sizeInBytes.

This may make the sizeInBytes not accurate.

  was:
When getting the statistics of a Project logical plan, if CBO not enabled, 
Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate the 
size in bytes, which will compute the ratio of the row size of the project plan 
and its child plan.

And the row size is computed based on the out attributes (columns). Currently, 
SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition columns of 
hive table as well, which is not reasonable, because hive partition column 
actually does not account for sizeInBytes.

This may make the sizeInBytes not accurate.


> Hive Partition columns should not be involved when calculating sizeInBytes of 
> Project logical plan
> --
>
> Key: SPARK-30469
> URL: https://issues.apache.org/jira/browse/SPARK-30469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Priority: Major
>
> When getting the statistics of a Project logical plan, if CBO not enabled, 
> Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate 
> the size in bytes, which will compute the ratio of the row size of the 
> project plan and its child plan.
> And the row size is computed based on the output attributes (columns). 
> Currently, SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition 
> columns of hive table as well, which is not reasonable, because hive 
> partition column actually does not account for sizeInBytes.
> This may make the sizeInBytes not accurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30469) Partition columns should not be involved when calculating sizeInBytes of Project logical plan

2020-01-09 Thread Hu Fuwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hu Fuwang updated SPARK-30469:
--
Summary: Partition columns should not be involved when calculating 
sizeInBytes of Project logical plan  (was: Hive Partition columns should not be 
involved when calculating sizeInBytes of Project logical plan)

> Partition columns should not be involved when calculating sizeInBytes of 
> Project logical plan
> -
>
> Key: SPARK-30469
> URL: https://issues.apache.org/jira/browse/SPARK-30469
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hu Fuwang
>Priority: Major
>
> When getting the statistics of a Project logical plan, if CBO not enabled, 
> Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate 
> the size in bytes, which will compute the ratio of the row size of the 
> project plan and its child plan.
> And the row size is computed based on the output attributes (columns). 
> Currently, SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition 
> columns of hive table as well, which is not reasonable, because hive 
> partition column actually does not account for sizeInBytes.
> This may make the sizeInBytes not accurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30469) Hive Partition columns should not be involved when calculating sizeInBytes of Project logical plan

2020-01-09 Thread Hu Fuwang (Jira)

Hu Fuwang created SPARK-30469:
-

 Summary: Hive Partition columns should not be involved when 
calculating sizeInBytes of Project logical plan
 Key: SPARK-30469
 URL: https://issues.apache.org/jira/browse/SPARK-30469
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hu Fuwang


When getting the statistics of a Project logical plan, if CBO not enabled, 
Spark will call SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode to calculate the 
size in bytes, which will compute the ratio of the row size of the project plan 
and its child plan.

And the row size is computed based on the out attributes (columns). Currently, 
SizeInBytesOnlyStatsPlanVisitor.visitUnaryNode involve partition columns of 
hive table as well, which is not reasonable, because hive partition column 
actually does not account for sizeInBytes.

This may make the sizeInBytes not accurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30468) Use multiple lines to display data columns for show create table command

2020-01-09 Thread Zhenhua Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-30468:
-
Description: 
Currently data columns are displayed in one line for show create table command, 
when the table has many columns (to make things even worse, columns may have 
long names or comments), the displayed result is really hard to read. E.g.
{noformat}
spark-sql> show create table test_table;

CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', 
`col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 
'This is comment for column 3')
USING parquet

{noformat}
To improve readability, we should print each column in a separate line.
{noformat}
spark-sql> show create table test_table;

CREATE TABLE `test_table` (
  `col1` INT COMMENT 'This is comment for column 1',
  `col2` STRING COMMENT 'This is comment for column 2',
  `col3` DOUBLE COMMENT 'This is comment for column 3')
USING parquet

{noformat}

  was:
Currently data columns are displayed in one line for show create table command, 
when the table has many columns, and columns may have long names or comments, 
the displayed result is really hard to read. E.g.

{noformat}

spark-sql> show create table test_table;

CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', 
`col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 
'This is comment for column 3')
USING parquet

{noformat}

To improve readability, we should print each column in a separate line.

{noformat}

spark-sql> show create table test_table;

CREATE TABLE `test_table` (
  `col1` INT COMMENT 'This is comment for column 1',
  `col2` STRING COMMENT 'This is comment for column 2',
  `col3` DOUBLE COMMENT 'This is comment for column 3')
USING parquet

{noformat}



> Use multiple lines to display data columns for show create table command
> 
>
> Key: SPARK-30468
> URL: https://issues.apache.org/jira/browse/SPARK-30468
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Priority: Minor
>
> Currently data columns are displayed in one line for show create table 
> command, when the table has many columns (to make things even worse, columns 
> may have long names or comments), the displayed result is really hard to 
> read. E.g.
> {noformat}
> spark-sql> show create table test_table;
> CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', 
> `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 
> 'This is comment for column 3')
> USING parquet
> {noformat}
> To improve readability, we should print each column in a separate line.
> {noformat}
> spark-sql> show create table test_table;
> CREATE TABLE `test_table` (
>   `col1` INT COMMENT 'This is comment for column 1',
>   `col2` STRING COMMENT 'This is comment for column 2',
>   `col3` DOUBLE COMMENT 'This is comment for column 3')
> USING parquet
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30468) Use multiple lines to display data columns for show create table command

2020-01-09 Thread Zhenhua Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-30468:
-
Description: 
Currently data columns are displayed in one line for show create table command, 
when the table has many columns, and columns may have long names or comments, 
the displayed result is really hard to read. E.g.

{noformat}

spark-sql> show create table test_table;

CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', 
`col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 
'This is comment for column 3')
USING parquet

{noformat}

To improve readability, we should print each column in a separate line.

{noformat}

spark-sql> show create table test_table;

CREATE TABLE `test_table` (
  `col1` INT COMMENT 'This is comment for column 1',
  `col2` STRING COMMENT 'This is comment for column 2',
  `col3` DOUBLE COMMENT 'This is comment for column 3')
USING parquet

{noformat}


  was:
Currently data columns are displayed in one line for show create table command, 
when the table has many columns, and columns may have long names or comments, 
the displayed result is really hard to read. E.g.

 {{{noformat}}}

spark-sql> show create table test_table;

CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', 
`col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 
'This is comment for column 3')

USING parquet

{{{noformat}}}

To improve readability, we should print each column in a separate line.


> Use multiple lines to display data columns for show create table command
> 
>
> Key: SPARK-30468
> URL: https://issues.apache.org/jira/browse/SPARK-30468
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Priority: Minor
>
> Currently data columns are displayed in one line for show create table 
> command, when the table has many columns, and columns may have long names or 
> comments, the displayed result is really hard to read. E.g.
> {noformat}
> spark-sql> show create table test_table;
> CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', 
> `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 
> 'This is comment for column 3')
> USING parquet
> {noformat}
> To improve readability, we should print each column in a separate line.
> {noformat}
> spark-sql> show create table test_table;
> CREATE TABLE `test_table` (
>   `col1` INT COMMENT 'This is comment for column 1',
>   `col2` STRING COMMENT 'This is comment for column 2',
>   `col3` DOUBLE COMMENT 'This is comment for column 3')
> USING parquet
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30468) Use multiple lines to display data columns for show create table command

2020-01-09 Thread Zhenhua Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-30468:
-
Description: 
Currently data columns are displayed in one line for show create table command, 
when the table has many columns, and columns may have long names or comments, 
the displayed result is really hard to read. E.g.

 {{{noformat}}}

spark-sql> show create table test_table;

CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', 
`col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 
'This is comment for column 3')

USING parquet

{{{noformat}}}

To improve readability, we should print each column in a separate line.

  was:
Currently data columns are displayed in one line for show create table command, 
when the table has many columns, and columns may have long names or comments, 
the displayed result is really hard to read. E.g.

```

spark-sql> show create table test_table;

CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', 
`col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 
'This is comment for column 3')

USING parquet

```

To improve readability, we should print each column in a separate line.


> Use multiple lines to display data columns for show create table command
> 
>
> Key: SPARK-30468
> URL: https://issues.apache.org/jira/browse/SPARK-30468
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Priority: Minor
>
> Currently data columns are displayed in one line for show create table 
> command, when the table has many columns, and columns may have long names or 
> comments, the displayed result is really hard to read. E.g.
>  {{{noformat}}}
> spark-sql> show create table test_table;
> CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', 
> `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 
> 'This is comment for column 3')
> USING parquet
> {{{noformat}}}
> To improve readability, we should print each column in a separate line.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30468) Use multiple lines to display data columns for show create table command

2020-01-09 Thread Zhenhua Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-30468:
-
Description: 
Currently data columns are displayed in one line for show create table command, 
when the table has many columns, and columns may have long names or comments, 
the displayed result is really hard to read. E.g.

```

spark-sql> show create table test_table;

CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', 
`col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 
'This is comment for column 3')

USING parquet

```

To improve readability, we should print each column in a separate line.

  was:Currently data columns are displayed in one line for show create table 
command, when the table has many columns, and even worse, colu


> Use multiple lines to display data columns for show create table command
> 
>
> Key: SPARK-30468
> URL: https://issues.apache.org/jira/browse/SPARK-30468
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Priority: Minor
>
> Currently data columns are displayed in one line for show create table 
> command, when the table has many columns, and columns may have long names or 
> comments, the displayed result is really hard to read. E.g.
> ```
> spark-sql> show create table test_table;
> CREATE TABLE `test_table` (`col1` INT COMMENT 'This is comment for column 1', 
> `col2` STRING COMMENT 'This is comment for column 2', `col3` DOUBLE COMMENT 
> 'This is comment for column 3')
> USING parquet
> ```
> To improve readability, we should print each column in a separate line.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30468) Use multiple lines to display data columns for show create table command

2020-01-09 Thread Zhenhua Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-30468:
-
Description: Currently data columns are displayed in one line for show 
create table command, when the table has many columns, and even worse, colu

> Use multiple lines to display data columns for show create table command
> 
>
> Key: SPARK-30468
> URL: https://issues.apache.org/jira/browse/SPARK-30468
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Priority: Minor
>
> Currently data columns are displayed in one line for show create table 
> command, when the table has many columns, and even worse, colu



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30468) Use multiple lines to display data columns for show create table command

2020-01-09 Thread Zhenhua Wang (Jira)

Zhenhua Wang created SPARK-30468:


 Summary: Use multiple lines to display data columns for show 
create table command
 Key: SPARK-30468
 URL: https://issues.apache.org/jira/browse/SPARK-30468
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Zhenhua Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28883) Fix a flaky test: ThriftServerQueryTestSuite

2020-01-09 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011533#comment-17011533
 ] 

Jungtaek Lim commented on SPARK-28883:
--

Would SPARK-30345 be a complement of this? Or does this issue cover more cases?

> Fix a flaky test: ThriftServerQueryTestSuite
> 
>
> Key: SPARK-28883
> URL: https://issues.apache.org/jira/browse/SPARK-28883
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109764/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/
>  (2 failures)
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109768/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/
>  (4 failures)
> Error message:
> {noformat}
> java.sql.SQLException: Could not open client transport with JDBC Uri: 
> jdbc:hive2://localhost:14431: java.net.ConnectException: Connection refused 
> (Connection refused)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

92 matches

Mail list logo