date:20190417

[jira] [Assigned] (SPARK-27493) Upgrade ASM to 7.1

2019-04-17 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27493:


Assignee: Dongjoon Hyun

> Upgrade ASM to 7.1
> --
>
> Key: SPARK-27493
> URL: https://issues.apache.org/jira/browse/SPARK-27493
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> SPARK-25946 upgrades ASM to 7.0 to support JDK11. This PR aims to update ASM 
> to 7.1 to bring the bug fixes.
> - https://asm.ow2.io/versions.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27493) Upgrade ASM to 7.1

2019-04-17 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27493.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24395
[https://github.com/apache/spark/pull/24395]

> Upgrade ASM to 7.1
> --
>
> Key: SPARK-27493
> URL: https://issues.apache.org/jira/browse/SPARK-27493
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> SPARK-25946 upgrades ASM to 7.0 to support JDK11. This PR aims to update ASM 
> to 7.1 to bring the bug fixes.
> - https://asm.ow2.io/versions.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27498) Built-in parquet code path does not respect hive.enforce.bucketing

2019-04-17 Thread Bruce Robbins (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-27498:
--
Description: 
_Caveat: I can see how this could be intentional if Spark believes that the 
built-in Parquet code path is creating Hive-compatible bucketed files. However, 
I assume that is not the case and that this is an actual bug._
  
 Spark makes an effort to avoid corrupting bucketed Hive tables unless the user 
overrides this behavior by setting hive.enforce.bucketing and 
hive.enforce.sorting to false.

However, this behavior falls down when Spark uses the built-in Parquet code 
path to write to the Hive table.

Here's an example.

In Hive, do this (I create a table where things work as expected, and one where 
things don't work as expected):
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebuckettext1;
hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as textfile;
hive> insert into hivebuckettext1 select * from sourcetable;
hive> drop table hivebucketparq1;
hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucketparq1 select * from sourcetable;
{noformat}
For the text table, things seem to work as expected:
{noformat}
scala> sql("insert into hivebuckettext1 select 1, 2, 3")
19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Output Hive table 
`default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate 
bucketed output which is compatible with Hive.;
{noformat}
For the parquet table, the insert just happens:
{noformat}
scala> sql("insert into hivebucketparq1 select 1, 2, 3")
res1: org.apache.spark.sql.DataFrame = []
scala> 
{noformat}
Note also that Spark has changed the table definition of hivebucketparq1 (in 
the HMS!) so that it is no longer a bucketed table. I will file a separate Jira 
on this (SPARK-27497).

If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as 
expected.

Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but 
InsertIntoHadoopFsRelationCommand does not. Probably the check should be made 
in an analyzer rule while the InsertIntoTable node still holds a 
HiveTableRelation.

 

 

  was:
_Caveat: I can see how this could be intentional if Spark believes that the 
built-in Parquet code path is creating Hive-compatible bucketed files. However, 
I assume that is not the case and that this is an actual bug._
  
 Spark makes an effort to avoid corrupting bucketed Hive tables unless the user 
overrides this behavior by setting hive.enforce.bucketing and 
hive.enforce.sorting to false.

However, this behavior falls down when Spark uses the built-in Parquet code 
path to write to the Hive table.

Here's an example.

In Hive, do this (I create a table where things work as expected, and one where 
things don't work as expected):
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebuckettext1;
hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as textfile;
hive> insert into hivebuckettext1 select * from sourcetable;
hive> drop table hivebucketparq1;
hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucketparq1 select * from sourcetable;
{noformat}
For the text table, things seem to work as expected:
{noformat}
scala> sql("insert into hivebuckettext1 select 1, 2, 3")
19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Output Hive table 
`default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate 
bucketed output which is compatible with Hive.;
{noformat}
For the parquet table, the insert just happens:
{noformat}
scala> sql("insert into hivebucketparq1 select 1, 2, 3")
res1: org.apache.spark.sql.DataFrame = []
scala> 
{noformat}
Note also that Spark has changed the table definition of hivebucketparq1 (in 
the HMS!) so that it is no longer a bucketed table. I will file a separate Jira 
on this (SPARK-27497).

If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as 
expected.

Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but 
InsertIntoHadoopFsRelationCommand does not.

 

 


> Built-in parquet code path does not respect hive.enforce.bucketing
> --
>
> Key: SPARK-27498
> URL: https://issues.apache.org/jira/browse/SPARK-27498
> Project: Spark
>  Issue Type: Bug

[jira] [Updated] (SPARK-27498) Built-in parquet code path does not respect hive.enforce.bucketing

2019-04-17 Thread Bruce Robbins (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-27498:
--
Description: 
_Caveat: I can see how this could be intentional if Spark believes that the 
built-in Parquet code path is creating Hive-compatible bucketed files. However, 
I assume that is not the case and that this is an actual bug._
  
 Spark makes an effort to avoid corrupting bucketed Hive tables unless the user 
overrides this behavior by setting hive.enforce.bucketing and 
hive.enforce.sorting to false.

However, this behavior falls down when Spark uses the built-in Parquet code 
path to write to the table.

Here's an example.

In Hive, do this (I create a table where things work as expected, and one where 
things don't work as expected):
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebuckettext1;
hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as textfile;
hive> insert into hivebuckettext1 select * from sourcetable;
hive> drop table hivebucketparq1;
hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucketparq1 select * from sourcetable;
{noformat}
For the text table, things seem to work as expected:
{noformat}
scala> sql("insert into hivebuckettext1 select 1, 2, 3")
19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Output Hive table 
`default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate 
bucketed output which is compatible with Hive.;
{noformat}
For the parquet table, the insert just happens:
{noformat}
scala> sql("insert into hivebucketparq1 select 1, 2, 3")
res1: org.apache.spark.sql.DataFrame = []
scala> 
{noformat}
Note also that Spark has changed the table definition of hivebucketparq1 (in 
the HMS!) so that it is no longer a bucketed table. I will file a separate Jira 
on this (SPARK-27497).

If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as 
expected.

Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but 
InsertIntoHadoopFsRelationCommand does not.

 

 

  was:
_Caveat: I can see how this could be intentional if Spark believes that the 
built-in Parquet code path is creating Hive-compatible bucketed files. However, 
I assume that is not the case and that this is an actual bug._
  
 Spark makes an effort to avoid corrupting Hive-bucketed tables unless the user 
overrides this behavior by setting hive.enforce.bucketing and 
hive.enforce.sorting to false.

However, this behavior falls down when Spark uses the built-in Parquet code 
path to write to the table.

Here's an example.

In Hive, do this (I create a table where things work as expected, and one where 
things don't work as expected):
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebuckettext1;
hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as textfile;
hive> insert into hivebuckettext1 select * from sourcetable;
hive> drop table hivebucketparq1;
hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucketparq1 select * from sourcetable;
{noformat}
For the text table, things seem to work as expected:
{noformat}
scala> sql("insert into hivebuckettext1 select 1, 2, 3")
19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Output Hive table 
`default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate 
bucketed output which is compatible with Hive.;
{noformat}
For the parquet table, the insert just happens:
{noformat}
scala> sql("insert into hivebucketparq1 select 1, 2, 3")
res1: org.apache.spark.sql.DataFrame = []
scala> 
{noformat}
Note also that Spark has changed the table definition of hivebucketparq1 (in 
the HMS!) so that it is no longer a bucketed table. I will file a separate Jira 
on this (SPARK-27497).

If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as 
expected.

Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but 
InsertIntoHadoopFsRelationCommand does not.

 

 


> Built-in parquet code path does not respect hive.enforce.bucketing
> --
>
> Key: SPARK-27498
> URL: https://issues.apache.org/jira/browse/SPARK-27498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Bruce Robbins
>Priority: Major
>

[jira] [Updated] (SPARK-27498) Built-in parquet code path does not respect hive.enforce.bucketing

2019-04-17 Thread Bruce Robbins (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-27498:
--
Description: 
_Caveat: I can see how this could be intentional if Spark believes that the 
built-in Parquet code path is creating Hive-compatible bucketed files. However, 
I assume that is not the case and that this is an actual bug._
  
 Spark makes an effort to avoid corrupting bucketed Hive tables unless the user 
overrides this behavior by setting hive.enforce.bucketing and 
hive.enforce.sorting to false.

However, this behavior falls down when Spark uses the built-in Parquet code 
path to write to the Hive table.

Here's an example.

In Hive, do this (I create a table where things work as expected, and one where 
things don't work as expected):
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebuckettext1;
hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as textfile;
hive> insert into hivebuckettext1 select * from sourcetable;
hive> drop table hivebucketparq1;
hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucketparq1 select * from sourcetable;
{noformat}
For the text table, things seem to work as expected:
{noformat}
scala> sql("insert into hivebuckettext1 select 1, 2, 3")
19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Output Hive table 
`default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate 
bucketed output which is compatible with Hive.;
{noformat}
For the parquet table, the insert just happens:
{noformat}
scala> sql("insert into hivebucketparq1 select 1, 2, 3")
res1: org.apache.spark.sql.DataFrame = []
scala> 
{noformat}
Note also that Spark has changed the table definition of hivebucketparq1 (in 
the HMS!) so that it is no longer a bucketed table. I will file a separate Jira 
on this (SPARK-27497).

If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as 
expected.

Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but 
InsertIntoHadoopFsRelationCommand does not.

 

 

  was:
_Caveat: I can see how this could be intentional if Spark believes that the 
built-in Parquet code path is creating Hive-compatible bucketed files. However, 
I assume that is not the case and that this is an actual bug._
  
 Spark makes an effort to avoid corrupting bucketed Hive tables unless the user 
overrides this behavior by setting hive.enforce.bucketing and 
hive.enforce.sorting to false.

However, this behavior falls down when Spark uses the built-in Parquet code 
path to write to the table.

Here's an example.

In Hive, do this (I create a table where things work as expected, and one where 
things don't work as expected):
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebuckettext1;
hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as textfile;
hive> insert into hivebuckettext1 select * from sourcetable;
hive> drop table hivebucketparq1;
hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucketparq1 select * from sourcetable;
{noformat}
For the text table, things seem to work as expected:
{noformat}
scala> sql("insert into hivebuckettext1 select 1, 2, 3")
19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Output Hive table 
`default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate 
bucketed output which is compatible with Hive.;
{noformat}
For the parquet table, the insert just happens:
{noformat}
scala> sql("insert into hivebucketparq1 select 1, 2, 3")
res1: org.apache.spark.sql.DataFrame = []
scala> 
{noformat}
Note also that Spark has changed the table definition of hivebucketparq1 (in 
the HMS!) so that it is no longer a bucketed table. I will file a separate Jira 
on this (SPARK-27497).

If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as 
expected.

Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but 
InsertIntoHadoopFsRelationCommand does not.

 

 


> Built-in parquet code path does not respect hive.enforce.bucketing
> --
>
> Key: SPARK-27498
> URL: https://issues.apache.org/jira/browse/SPARK-27498
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Bruce Robbins
>Priority:

[jira] [Updated] (SPARK-27497) Spark wipes out bucket spec in metastore when updating table stats

2019-04-17 Thread Bruce Robbins (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-27497:
--
Description: 
The bucket spec gets wiped out after Spark writes to a Hive-bucketed table that 
has the following characteristics:
 - table is created by Hive (or even Spark, if you use HQL DDL)
 - table is stored in Parquet format
 - table has at least one Hive-created data file already

For example, do the following in Hive:
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebucket1;
hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted 
by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucket1 select * from sourcetable;
hive> show create table hivebucket1;
OK
CREATE TABLE `hivebucket1`(
  `a` int, 
  `b` int, 
  `c` int)
CLUSTERED BY ( 
  a, 
  b) 
SORTED BY ( 
  a ASC, 
  b ASC) 
INTO 10 BUCKETS
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true', 
  'numFiles'='1', 
  'numRows'='1', 
  'rawDataSize'='3', 
  'totalSize'='352', 
  'transient_lastDdlTime'='142971')
Time taken: 0.056 seconds, Fetched: 26 row(s)
hive> 
{noformat}
Then in spark-shell, do the following:
{noformat}
scala> sql("insert into hivebucket1 select 1, 3, 7")
19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = []
{noformat}
Note: At this point, I would have expected Spark to throw an 
{{AnalysisException}} with the message "Output Hive table 
`default`.`hivebucket1` is bucketed...". However, I am ignoring that for now 
and may open a separate Jira (SPARK-27498).

Return to some Hive CLI and note that the bucket specification is gone from the 
table definition:
{noformat}
hive> show create table hivebucket1;
OK
CREATE TABLE `hivebucket1`(
  `a` int, 
  `b` int, 
  `c` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  ''
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='false', 
  'SORTBUCKETCOLSPREFIX'='TRUE', 
  'numFiles'='2', 
  'numRows'='-1', 
  'rawDataSize'='-1', 
  'totalSize'='1144', 
  'transient_lastDdlTime'='123374')
Time taken: 1.619 seconds, Fetched: 20 row(s)
hive> 
{noformat}
This information is lost when Spark attempts to update table stats. 
HiveClientImpl.toHiveTable drops the bucket specification. toHiveTable drops 
the bucket information because {{table.provider}} is None instead of "hive". 
{{table.provider}} is not "hive" because Spark bypassed the serdes and used the 
built-in parquet code path (by default, spark.sql.hive.convertMetastoreParquet 
is true).

  was:
The bucket spec gets wiped out after Spark writes to a Hive-bucketed table that 
has the following characteristics:
 - table is created by Hive (or even Spark, if you use HQL DDL)
 - table is stored in Parquet format
 - table has at least one Hive-created data file already

For example, do the following in Hive:
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebucket1;
hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted 
by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucket1 select * from sourcetable;
hive> show create table hivebucket1;
OK
CREATE TABLE `hivebucket1`(
  `a` int, 
  `b` int, 
  `c` int)
CLUSTERED BY ( 
  a, 
  b) 
SORTED BY ( 
  a ASC, 
  b ASC) 
INTO 10 BUCKETS
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true', 
  'numFiles'='1', 
  'numRows'='1', 
  'rawDataSize'='3', 
  'totalSize'='352', 
  'transient_lastDdlTime'='142971')
Time taken: 0.056 seconds, Fetched: 26 row(s)
hive> 
{noformat}
Then in spark-shell, do the following:
{noformat}
scala> sql("insert into hivebucket1 select 1, 3, 7")
19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = []
{noformat}
Note: At this point, I would have expected Spark to throw an 
{{AnalysisException}} with the message "Output Hive table 
`default`.`hivebucket1` is

[jira] [Updated] (SPARK-27497) Spark wipes out bucket spec in metastore when updating table stats

2019-04-17 Thread Bruce Robbins (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-27497:
--
Description: 
The bucket spec gets wiped out after Spark writes to a Hive-bucketed table that 
has the following characteristics:
 - table is created by Hive (or even Spark, if you use HQL DDL)
 - table is stored in Parquet format
 - table has at least one Hive-created data file already

For example, do the following in Hive:
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebucket1;
hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted 
by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucket1 select * from sourcetable;
hive> show create table hivebucket1;
OK
CREATE TABLE `hivebucket1`(
  `a` int, 
  `b` int, 
  `c` int)
CLUSTERED BY ( 
  a, 
  b) 
SORTED BY ( 
  a ASC, 
  b ASC) 
INTO 10 BUCKETS
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true', 
  'numFiles'='1', 
  'numRows'='1', 
  'rawDataSize'='3', 
  'totalSize'='352', 
  'transient_lastDdlTime'='142971')
Time taken: 0.056 seconds, Fetched: 26 row(s)
hive> 
{noformat}
Then in spark-shell, do the following:
{noformat}
scala> sql("insert into hivebucket1 select 1, 3, 7")
19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = []
{noformat}
Note: At this point, I would have expected Spark to throw an 
{{AnalysisException}} with the message "Output Hive table 
`default`.`hivebucket1` is bucketed...". However, I am ignoring that for now 
and may open a separate Jira (SPARK-27498).

Return to some Hive CLI and note that the bucket specification is gone from the 
table definition:
{noformat}
hive> show create table hivebucket1;
OK
CREATE TABLE `hivebucket1`(
  `a` int, 
  `b` int, 
  `c` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  ''
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='false', 
  'SORTBUCKETCOLSPREFIX'='TRUE', 
  'numFiles'='2', 
  'numRows'='-1', 
  'rawDataSize'='-1', 
  'totalSize'='1144', 
  'transient_lastDdlTime'='123374')
Time taken: 1.619 seconds, Fetched: 20 row(s)
hive> 
{noformat}
This information is lost when Spark attempts to update table stats. This is 
because HiveClientImpl.toHiveTable drops the bucket specification. toHiveTable 
drops the bucket information because {{table.provider}} is None. 
{{table.provider}} is None because (I assume) Spark bypassed the serdes and 
used the built-in parquet code path (by default, 
spark.sql.hive.convertMetastoreParquet is true).


  was:
The bucket spec gets wiped out after Spark writes to a Hive-bucketed table that 
has the following characteristics:
 - table is created by Hive (or even Spark, if you use HQL DDL)
 - table is stored in Parquet format
 - table has at least one Hive-created data file already

For example, do the following in Hive:
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebucket1;
hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted 
by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucket1 select * from sourcetable;
hive> show create table hivebucket1;
OK
CREATE TABLE `hivebucket1`(
  `a` int, 
  `b` int, 
  `c` int)
CLUSTERED BY ( 
  a, 
  b) 
SORTED BY ( 
  a ASC, 
  b ASC) 
INTO 10 BUCKETS
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true', 
  'numFiles'='1', 
  'numRows'='1', 
  'rawDataSize'='3', 
  'totalSize'='352', 
  'transient_lastDdlTime'='142971')
Time taken: 0.056 seconds, Fetched: 26 row(s)
hive> 
{noformat}
Then in spark-shell, do the following:
{noformat}
scala> sql("insert into hivebucket1 select 1, 3, 7")
19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = []
{noformat}
Note: At this point, I would have expected Spark to throw an 
{{AnalysisException}} with the message "Output Hive table 
`default`.`hivebucket1` is

[jira] [Commented] (SPARK-25422) flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated (encryption = on) (with replication as stream)

2019-04-17 Thread Mike Chan (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820707#comment-16820707
 ] 

Mike Chan commented on SPARK-25422:
---

Will this problem potentially hitting Spark 2.3.1 as well? I have a new cluster 
at this version and always hitting corrupt remote block when 1 specific table 
involved. 

> flaky test: org.apache.spark.DistributedSuite.caching on disk, replicated 
> (encryption = on) (with replication as stream)
> 
>
> Key: SPARK-25422
> URL: https://issues.apache.org/jira/browse/SPARK-25422
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Imran Rashid
>Priority: Major
> Fix For: 2.4.0
>
>
> stacktrace
> {code}
>  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 7, localhost, executor 1): java.io.IOException: 
> org.apache.spark.SparkException: corrupt remote block broadcast_0_piece0 of 
> broadcast_0: 1651574976 != 1165629262
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1320)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>   at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:84)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$7.apply(Executor.scala:367)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1347)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:373)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: corrupt remote block 
> broadcast_0_piece0 of broadcast_0: 1651574976 != 1165629262
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:167)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:151)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:151)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:231)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1313)
>   ... 13 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27441) Add read/write tests to Hive serde tables(include Parquet vectorized reader)

2019-04-17 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27441:

Summary: Add read/write tests to Hive serde tables(include Parquet 
vectorized reader)  (was: Add read/write tests to Hive serde tables)

> Add read/write tests to Hive serde tables(include Parquet vectorized reader)
> 
>
> Key: SPARK-27441
> URL: https://issues.apache.org/jira/browse/SPARK-27441
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> The versions between Hive, Parquet and ORC after the built-in Hive upgrade to 
> 2.3.4:
> built-in Hive is 1.2.1:
> || ||ORC||Parquet||
> |Spark datasource table|1.5.5|1.10.1|
> |Spark hive table|Hive built-in|1.6.0|
> |Hive 1.2.1|Hive built-in|1.6.0|
> built-in Hive is 2.3.4:
> || ||ORC||Parquet||
> |Spark datasource table|1.5.5|1.10.1|
> |Spark hive table|1.5.5|1.8.1|
> |Hive 2.3.4|1.3.3|1.8.1|
>  We should add a test for Hive Serde table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27441) Add read/write tests to Hive serde tables

2019-04-17 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27441:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-27500

> Add read/write tests to Hive serde tables
> -
>
> Key: SPARK-27441
> URL: https://issues.apache.org/jira/browse/SPARK-27441
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> The versions between Hive, Parquet and ORC after the built-in Hive upgrade to 
> 2.3.4:
> built-in Hive is 1.2.1:
> || ||ORC||Parquet||
> |Spark datasource table|1.5.5|1.10.1|
> |Spark hive table|Hive built-in|1.6.0|
> |Hive 1.2.1|Hive built-in|1.6.0|
> built-in Hive is 2.3.4:
> || ||ORC||Parquet||
> |Spark datasource table|1.5.5|1.10.1|
> |Spark hive table|1.5.5|1.8.1|
> |Hive 2.3.4|1.3.3|1.8.1|
>  We should add a test for Hive Serde table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27501) Add test for HIVE-13083: Writing HiveDecimal to ORC can wrongly suppress present stream

2019-04-17 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-27501:
---

 Summary: Add test for HIVE-13083: Writing HiveDecimal to ORC can 
wrongly suppress present stream
 Key: SPARK-27501
 URL: https://issues.apache.org/jira/browse/SPARK-27501
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27500) Add tests for the built-in Hive 2.3

2019-04-17 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-27500:
---

 Summary: Add tests for the built-in Hive 2.3
 Key: SPARK-27500
 URL: https://issues.apache.org/jira/browse/SPARK-27500
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


Our Spark will use some of the new features and bug fixes of Hive 2.3, and we 
should add tests for these. This is an umbrella JIRA for tracking this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27499) Support mapping spark.local.dir to hostPath volume

2019-04-17 Thread Junjie Chen (JIRA)

Junjie Chen created SPARK-27499:
---

 Summary: Support mapping spark.local.dir to hostPath volume
 Key: SPARK-27499
 URL: https://issues.apache.org/jira/browse/SPARK-27499
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.1
Reporter: Junjie Chen


Currently, the k8s executor builder mount spark.local.dir as emptyDir or 
memory, it should satisfy some small workload, while in some heavily workload 
like TPCDS, both of them can have some problem, such as pods are evicted due to 
disk pressure when using emptyDir, and OOM when using tmpfs.

In particular on cloud environment, users may allocate cluster with minimum 
configuration and add cloud storage when running workload. In this case, we can 
specify multiple elastic storage as spark.local.dir to accelerate the spilling. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17668) Support representing structs with case classes and tuples in spark sql udf inputs

2019-04-17 Thread william hesch (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-17668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820637#comment-16820637
 ] 

william hesch commented on SPARK-17668:
---

+1

> Support representing structs with case classes and tuples in spark sql udf 
> inputs
> -
>
> Key: SPARK-17668
> URL: https://issues.apache.org/jira/browse/SPARK-17668
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: koert kuipers
>Priority: Minor
>
> after having gotten used to have case classes represent complex structures in 
> Datasets, i am surprised to find out that when i work in DataFrames with udfs 
> no such magic exists, and i have to fall back to manipulating Row objects, 
> which is error prone and somewhat ugly.
> for example:
> {noformat}
> case class Person(name: String, age: Int)
> val df = Seq((Person("john", 33), 5), (Person("mike", 30), 6)).toDF("person", 
> "id")
> val df1 = df.withColumn("person", udf({ (p: Person) => p.copy(age = p.age + 
> 1) }).apply(col("person")))
> df1.printSchema
> df1.show
> {noformat}
> leads to:
> {noformat}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast 
> to Person
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27491) SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty response! therefore Airflow won't integrate with Spark 2.3.x

2019-04-17 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820619#comment-16820619
 ] 

Stavros Kontopoulos commented on SPARK-27491:
-

Can you reach the rest api outside spark submit?

> SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty 
> response! therefore Airflow won't integrate with Spark 2.3.x
> --
>
> Key: SPARK-27491
> URL: https://issues.apache.org/jira/browse/SPARK-27491
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Scheduler, Spark Core, Spark Shell, Spark 
> Submit
>Affects Versions: 2.3.3
>Reporter: t oo
>Priority: Blocker
>
> This issue must have been introduced after Spark 2.1.1 as it is working in 
> that version. This issue is affecting me in Spark 2.3.3/2.3.0. I am using 
> spark standalone mode if that makes a difference.
> See below spark 2.3.3 returns empty response while 2.1.1 returns a response.
>  
> Spark 2.1.1:
> [ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home1/bin/spark-class 
> org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
> driver-20190417130324-0009
> + export SPARK_HOME=/home/ec2here/spark_home1
> + SPARK_HOME=/home/ec2here/spark_home1
> + '[' -z /home/ec2here/spark_home1 ']'
> + . /home/ec2here/spark_home1/bin/load-spark-env.sh
> ++ '[' -z /home/ec2here/spark_home1 ']'
> ++ '[' -z '' ']'
> ++ export SPARK_ENV_LOADED=1
> ++ SPARK_ENV_LOADED=1
> ++ parent_dir=/home/ec2here/spark_home1
> ++ user_conf_dir=/home/ec2here/spark_home1/conf
> ++ '[' -f /home/ec2here/spark_home1/conf/spark-env.sh ']'
> ++ set -a
> ++ . /home/ec2here/spark_home1/conf/spark-env.sh
> +++ export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
> +++ JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
>  ulimit -n 1048576
> ++ set +a
> ++ '[' -z '' ']'
> ++ ASSEMBLY_DIR2=/home/ec2here/spark_home1/assembly/target/scala-2.11
> ++ ASSEMBLY_DIR1=/home/ec2here/spark_home1/assembly/target/scala-2.10
> ++ [[ -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ]]
> ++ '[' -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ']'
> ++ export SPARK_SCALA_VERSION=2.10
> ++ SPARK_SCALA_VERSION=2.10
> + '[' -n /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 ']'
> + RUNNER=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java
> + '[' -d /home/ec2here/spark_home1/jars ']'
> + SPARK_JARS_DIR=/home/ec2here/spark_home1/jars
> + '[' '!' -d /home/ec2here/spark_home1/jars ']'
> + LAUNCH_CLASSPATH='/home/ec2here/spark_home1/jars/*'
> + '[' -n '' ']'
> + [[ -n '' ]]
> + CMD=()
> + IFS=
> + read -d '' -r ARG
> ++ build_command org.apache.spark.deploy.SparkSubmit --master 
> spark://domainhere:6066 --status driver-20190417130324-0009
> ++ /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -Xmx128m -cp 
> '/home/ec2here/spark_home1/jars/*' org.apache.spark.launcher.Main 
> org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
> driver-20190417130324-0009
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> ++ printf '%d\0' 0
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + COUNT=10
> + LAST=9
> + LAUNCHER_EXIT_CODE=0
> + [[ 0 =~ ^[0-9]+$ ]]
> + '[' 0 '!=' 0 ']'
> + CMD=("${CMD[@]:0:$LAST}")
> + exec /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -cp 
> '/home/ec2here/spark_home1/conf/:/home/ec2here/spark_home1/jars/*' -Xmx2048m 
> org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
> driver-20190417130324-0009
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 19/04/17 14:03:27 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20190417130324-0009 in spark://domainhere:6066.
> 19/04/17 14:03:28 INFO RestSubmissionClient: Server responded with 
> SubmissionStatusResponse:
> {
>  "action" : "SubmissionStatusResponse",
>  "driverState" : "FAILED",
>  "serverSparkVersion" : "2.3.3",
>  "submissionId" : "driver-20190417130324-0009",
>  "success" : true,
>  "workerHostPort" : "x.y.211.40:11819",
>  "workerId" : "worker-20190417115840-x.y.211.40-11819"
> }
> [ec2here@ip-x-y-160-225 ~]$
>  
> Spark 2.3.3:
> [ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home/bin/spark-class 
> org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
> driver-20190417130324-0009
> + '[' -z '' ']'
> ++ dirname

[jira] [Created] (SPARK-27498) Built-in parquet code path does not respect hive.enforce.bucketing

2019-04-17 Thread Bruce Robbins (JIRA)

Bruce Robbins created SPARK-27498:
-

 Summary: Built-in parquet code path does not respect 
hive.enforce.bucketing
 Key: SPARK-27498
 URL: https://issues.apache.org/jira/browse/SPARK-27498
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: Bruce Robbins


_Caveat: I can see how this could be intentional if Spark believes that the 
built-in Parquet code path is creating Hive-compatible bucketed files. However, 
I assume that is not the case and that this is an actual bug._
  
 Spark makes an effort to avoid corrupting Hive-bucketed tables unless the user 
overrides this behavior by setting hive.enforce.bucketing and 
hive.enforce.sorting to false.

However, this behavior falls down when Spark uses the built-in Parquet code 
path to write to the table.

Here's an example.

In Hive, do this (I create a table where things work as expected, and one where 
things don't work as expected):
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebuckettext1;
hive> create table hivebuckettext1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as textfile;
hive> insert into hivebuckettext1 select * from sourcetable;
hive> drop table hivebucketparq1;
hive> create table hivebucketparq1 (a int, b int, c int) clustered by (a, b) 
sorted by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucketparq1 select * from sourcetable;
{noformat}
For the text table, things seem to work as expected:
{noformat}
scala> sql("insert into hivebuckettext1 select 1, 2, 3")
19/04/17 10:26:08 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
org.apache.spark.sql.AnalysisException: Output Hive table 
`default`.`hivebuckettext1` is bucketed but Spark currently does NOT populate 
bucketed output which is compatible with Hive.;
{noformat}
For the parquet table, the insert just happens:
{noformat}
scala> sql("insert into hivebucketparq1 select 1, 2, 3")
res1: org.apache.spark.sql.DataFrame = []
scala> 
{noformat}
Note also that Spark has changed the table definition of hivebucketparq1 (in 
the HMS!) so that it is no longer a bucketed table. I will file a separate Jira 
on this (SPARK-27497).

If you specify "spark.sql.hive.convertMetastoreParquet=false", things work as 
expected.

Basically, InsertIntoHiveTable respects hive.enforce.bucketing, but 
InsertIntoHadoopFsRelationCommand does not.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27497) Spark wipes out bucket spec in metastore when updating table stats

2019-04-17 Thread Bruce Robbins (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-27497:
--
Description: 
The bucket spec gets wiped out after Spark writes to a Hive-bucketed table that 
has the following characteristics:
 - table is created by Hive (or even Spark, if you use HQL DDL)
 - table is stored in Parquet format
 - table has at least one Hive-created data file already

For example, do the following in Hive:
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebucket1;
hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted 
by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucket1 select * from sourcetable;
hive> show create table hivebucket1;
OK
CREATE TABLE `hivebucket1`(
  `a` int, 
  `b` int, 
  `c` int)
CLUSTERED BY ( 
  a, 
  b) 
SORTED BY ( 
  a ASC, 
  b ASC) 
INTO 10 BUCKETS
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true', 
  'numFiles'='1', 
  'numRows'='1', 
  'rawDataSize'='3', 
  'totalSize'='352', 
  'transient_lastDdlTime'='142971')
Time taken: 0.056 seconds, Fetched: 26 row(s)
hive> 
{noformat}
Then in spark-shell, do the following:
{noformat}
scala> sql("insert into hivebucket1 select 1, 3, 7")
19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = []
{noformat}
Note: At this point, I would have expected Spark to throw an 
{{AnalysisException}} with the message "Output Hive table 
`default`.`hivebucket1` is bucketed...". However, I am ignoring that for now 
and may open a separate Jira (SPARK-27498).

Return to some Hive CLI and note that the bucket specification is gone from the 
table definition:
{noformat}
hive> show create table hivebucket1;
OK
CREATE TABLE `hivebucket1`(
  `a` int, 
  `b` int, 
  `c` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  ''
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='false', 
  'SORTBUCKETCOLSPREFIX'='TRUE', 
  'numFiles'='2', 
  'numRows'='-1', 
  'rawDataSize'='-1', 
  'totalSize'='1144', 
  'transient_lastDdlTime'='123374')
Time taken: 1.619 seconds, Fetched: 20 row(s)
hive> 
{noformat}
This information is lost when Spark attempts to update table stats. This is 
because HiveClientImpl.toHiveTable drops the bucket specification.

  was:
The bucket spec gets wiped out after Spark writes to a Hive-bucketed table that 
has the following characteristics:
 - table is created by Hive (or even Spark, if you use HQL DDL)
 - table is stored in Parquet format
 - table has at least one Hive-created data file already

For example, do the following in Hive:
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebucket1;
hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted 
by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucket1 select * from sourcetable;
hive> show create table hivebucket1;
OK
CREATE TABLE `hivebucket1`(
  `a` int, 
  `b` int, 
  `c` int)
CLUSTERED BY ( 
  a, 
  b) 
SORTED BY ( 
  a ASC, 
  b ASC) 
INTO 10 BUCKETS
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true', 
  'numFiles'='1', 
  'numRows'='1', 
  'rawDataSize'='3', 
  'totalSize'='352', 
  'transient_lastDdlTime'='142971')
Time taken: 0.056 seconds, Fetched: 26 row(s)
hive> 
{noformat}
Then in spark-shell, do the following:
{noformat}
scala> sql("insert into hivebucket1 select 1, 3, 7")
19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = []
{noformat}
Note: At this point, I would have expected Spark to throw an 
{{AnalysisException}} with the message "Output Hive table 
`default`.`hivebucket1` is bucketed...". However, I am ignoring that for now 
and may open a separate Jira.

Return to some Hive CLI and note that the bucket specification is gone from the 
table definition:
{noformat}
hive> show create table hivebucket1;
OK
CREATE TABLE

[jira] [Updated] (SPARK-27497) Spark wipes out bucket spec in metastore when updating table stats

2019-04-17 Thread Bruce Robbins (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-27497:
--
Description: 
The bucket spec gets wiped out after Spark writes to a Hive-bucketed table that 
has the following characteristics:
 - table is created by Hive (or even Spark, if you use HQL DDL)
 - table is stored in Parquet format
 - table has at least one Hive-created data file already

For example, do the following in Hive:
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebucket1;
hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted 
by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucket1 select * from sourcetable;
hive> show create table hivebucket1;
OK
CREATE TABLE `hivebucket1`(
  `a` int, 
  `b` int, 
  `c` int)
CLUSTERED BY ( 
  a, 
  b) 
SORTED BY ( 
  a ASC, 
  b ASC) 
INTO 10 BUCKETS
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true', 
  'numFiles'='1', 
  'numRows'='1', 
  'rawDataSize'='3', 
  'totalSize'='352', 
  'transient_lastDdlTime'='142971')
Time taken: 0.056 seconds, Fetched: 26 row(s)
hive> 
{noformat}
Then in spark-shell, do the following:
{noformat}
scala> sql("insert into hivebucket1 select 1, 3, 7")
19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = []
{noformat}
Note: At this point, I would have expected Spark to throw an 
{{AnalysisException}} with the message "Output Hive table 
`default`.`hivebucket1` is bucketed...". However, I am ignoring that for now 
and may open a separate Jira.

Return to some Hive CLI and note that the bucket specification is gone from the 
table definition:
{noformat}
hive> show create table hivebucket1;
OK
CREATE TABLE `hivebucket1`(
  `a` int, 
  `b` int, 
  `c` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  ''
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='false', 
  'SORTBUCKETCOLSPREFIX'='TRUE', 
  'numFiles'='2', 
  'numRows'='-1', 
  'rawDataSize'='-1', 
  'totalSize'='1144', 
  'transient_lastDdlTime'='123374')
Time taken: 1.619 seconds, Fetched: 20 row(s)
hive> 
{noformat}
This information is lost when Spark attempts to update table stats. This is 
because HiveClientImpl.toHiveTable drops the bucket specification.

  was:
The bucket spec gets wiped out after Spark writes to Hive-bucketed table that 
has the following characteristics:
 - table is created by Hive (or even Spark, if you use HQL DDL)
 - table is stored in Parquet format
 - table has at least one Hive-created data file already

For example, do the following in Hive:
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebucket1;
hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted 
by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucket1 select * from sourcetable;
hive> show create table hivebucket1;
OK
CREATE TABLE `hivebucket1`(
  `a` int, 
  `b` int, 
  `c` int)
CLUSTERED BY ( 
  a, 
  b) 
SORTED BY ( 
  a ASC, 
  b ASC) 
INTO 10 BUCKETS
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true', 
  'numFiles'='1', 
  'numRows'='1', 
  'rawDataSize'='3', 
  'totalSize'='352', 
  'transient_lastDdlTime'='142971')
Time taken: 0.056 seconds, Fetched: 26 row(s)
hive> 
{noformat}
Then in spark-shell, do the following:
{noformat}
scala> sql("insert into hivebucket1 select 1, 3, 7")
19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = []
{noformat}
Note: At this point, I would have expected Spark to throw an 
{{AnalysisException}} with the message "Output Hive table 
`default`.`hivebucket1` is bucketed...". However, I am ignoring that for now 
and may open a separate Jira.

Return to some Hive CLI and note that the bucket specification is gone from the 
table definition:
{noformat}
hive> show create table hivebucket1;
OK
CREATE TABLE `hivebucket1`(
  `a` int,

[jira] [Created] (SPARK-27497) Spark wipes out bucket spec in metastore when updating table stats

2019-04-17 Thread Bruce Robbins (JIRA)

Bruce Robbins created SPARK-27497:
-

 Summary: Spark wipes out bucket spec in metastore when updating 
table stats
 Key: SPARK-27497
 URL: https://issues.apache.org/jira/browse/SPARK-27497
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: Bruce Robbins


The bucket spec gets wiped out after Spark writes to Hive-bucketed table that 
has the following characteristics:
 - table is created by Hive (or even Spark, if you use HQL DDL)
 - table is stored in Parquet format
 - table has at least one Hive-created data file already

For example, do the following in Hive:
{noformat}
hive> create table sourcetable as select 1 a, 3 b, 7 c;
hive> drop table hivebucket1;
hive> create table hivebucket1 (a int, b int, c int) clustered by (a, b) sorted 
by (a, b asc) into 10 buckets stored as parquet;
hive> insert into hivebucket1 select * from sourcetable;
hive> show create table hivebucket1;
OK
CREATE TABLE `hivebucket1`(
  `a` int, 
  `b` int, 
  `c` int)
CLUSTERED BY ( 
  a, 
  b) 
SORTED BY ( 
  a ASC, 
  b ASC) 
INTO 10 BUCKETS
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'file:/Users/brobbins/github/spark_upstream/spark-warehouse/hivebucket1'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true', 
  'numFiles'='1', 
  'numRows'='1', 
  'rawDataSize'='3', 
  'totalSize'='352', 
  'transient_lastDdlTime'='142971')
Time taken: 0.056 seconds, Fetched: 26 row(s)
hive> 
{noformat}
Then in spark-shell, do the following:
{noformat}
scala> sql("insert into hivebucket1 select 1, 3, 7")
19/04/17 10:49:30 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = []
{noformat}
Note: At this point, I would have expected Spark to throw an 
{{AnalysisException}} with the message "Output Hive table 
`default`.`hivebucket1` is bucketed...". However, I am ignoring that for now 
and may open a separate Jira.

Return to some Hive CLI and note that the bucket specification is gone from the 
table definition:
{noformat}
hive> show create table hivebucket1;
OK
CREATE TABLE `hivebucket1`(
  `a` int, 
  `b` int, 
  `c` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  ''
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='false', 
  'SORTBUCKETCOLSPREFIX'='TRUE', 
  'numFiles'='2', 
  'numRows'='-1', 
  'rawDataSize'='-1', 
  'totalSize'='1144', 
  'transient_lastDdlTime'='123374')
Time taken: 1.619 seconds, Fetched: 20 row(s)
hive> 
{noformat}
This information is lost when Spark attempts to update table stats. This is 
because HiveClientImpl.toHiveTable drops the bucket specification.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27365) Spark Jenkins supports testing GPU-aware scheduling features

2019-04-17 Thread Calvin Park (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820549#comment-16820549
 ] 

Calvin Park edited comment on SPARK-27365 at 4/17/19 10:34 PM:
---

Simple Jenkinsfile
{code:java}
pipeline {
agent {
dockerfile {
label 'docker-gpu'
args '--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all'
}
}
stages {
stage('smi') {
steps {
sh 'nvidia-smi'
}
}
}
}{code}
with Dockerfile
{code:java}
FROM ubuntu:xenial

RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y \
curl \
flake8 \
git-core \
openjdk-8-jdk \
python2.7 \
python-pip \
wget

RUN pip install \
requests \
numpy

# Build script looks for javac in jre dir
ENV JAVA_HOME "/usr/lib/jvm/java-8-openjdk-amd64"

# 
http://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage
# We have a pretty beefy server
ENV MAVEN_OPTS "-Xmx20g -XX:ReservedCodeCacheSize=2g"
{code}


was (Author: calvinatnvidia):
Simple Jenkinsfile
{code:java}
pipeline {
agent {
dockerfile {
label 'docker-gpu'
args '--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all'
}
}
stages {
stage('smi') {
steps {
sh 'nvidia-smi'
}
}
}
}{code}
with Dockerfile
{code:java}
FROM ubuntu:xenial

RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y \
curl \
flake8 \
git-core \
openjdk-8-jdk \
python2.7 \
python-pip \
wget

RUN DEBIAN_FRONTEND=noninteractive pip install \
requests \
numpy

# Build script looks for javac in jre dir
ENV JAVA_HOME "/usr/lib/jvm/java-8-openjdk-amd64"

# 
http://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage
# We have a pretty beefy server
ENV MAVEN_OPTS "-Xmx20g -XX:ReservedCodeCacheSize=2g"
{code}

> Spark Jenkins supports testing GPU-aware scheduling features
> 
>
> Key: SPARK-27365
> URL: https://issues.apache.org/jira/browse/SPARK-27365
> Project: Spark
>  Issue Type: Story
>  Components: jenkins
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Upgrade Spark Jenkins to install GPU cards and run GPU integration tests 
> triggered by "GPU" in PRs.
> cc: [~afeng] [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27365) Spark Jenkins supports testing GPU-aware scheduling features

2019-04-17 Thread Calvin Park (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820549#comment-16820549
 ] 

Calvin Park edited comment on SPARK-27365 at 4/17/19 10:34 PM:
---

Simple Jenkinsfile
{code:java}
pipeline {
agent {
dockerfile {
label 'docker-gpu'
args '--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all'
}
}
stages {
stage('smi') {
steps {
sh 'nvidia-smi'
}
}
}
}{code}
with Dockerfile
{code:java}
FROM ubuntu:xenial

RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -y \
curl \
flake8 \
git-core \
openjdk-8-jdk \
python2.7 \
python-pip \
wget

RUN DEBIAN_FRONTEND=noninteractive pip install \
requests \
numpy

# Build script looks for javac in jre dir
ENV JAVA_HOME "/usr/lib/jvm/java-8-openjdk-amd64"

# 
http://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage
# We have a pretty beefy server
ENV MAVEN_OPTS "-Xmx20g -XX:ReservedCodeCacheSize=2g"
{code}


was (Author: calvinatnvidia):
Simple Jenkinsfile
{code:java}
pipeline {
agent {
dockerfile {
label 'docker-gpu'
args '--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all'
}
}
stages {
stage('smi') {
steps {
sh 'nvidia-smi'
}
}
}
}{code}
with Dockerfile
FROM ubuntu:xenialRUN apt-get update && \DEBIAN_FRONTEND=noninteractive 
apt-get install -y \curl \flake8 \git-core \openjdk-8-jdk \
python2.7 \python-pip \wgetRUN DEBIAN_FRONTEND=noninteractive pip 
install \requests \numpy# Build script looks for javac in jre dirENV 
JAVA_HOME "/usr/lib/jvm/java-8-openjdk-amd64"# 
http://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage#
 We have a pretty beefy serverENV MAVEN_OPTS "-Xmx20g 
-XX:ReservedCodeCacheSize=2g"

> Spark Jenkins supports testing GPU-aware scheduling features
> 
>
> Key: SPARK-27365
> URL: https://issues.apache.org/jira/browse/SPARK-27365
> Project: Spark
>  Issue Type: Story
>  Components: jenkins
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Upgrade Spark Jenkins to install GPU cards and run GPU integration tests 
> triggered by "GPU" in PRs.
> cc: [~afeng] [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27365) Spark Jenkins supports testing GPU-aware scheduling features

2019-04-17 Thread Calvin Park (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820549#comment-16820549
 ] 

Calvin Park commented on SPARK-27365:
-

Simple Jenkinsfile
{code:java}
pipeline {
agent {
dockerfile {
label 'docker-gpu'
args '--runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all'
}
}
stages {
stage('smi') {
steps {
sh 'nvidia-smi'
}
}
}
}{code}
with Dockerfile
FROM ubuntu:xenialRUN apt-get update && \DEBIAN_FRONTEND=noninteractive 
apt-get install -y \curl \flake8 \git-core \openjdk-8-jdk \
python2.7 \python-pip \wgetRUN DEBIAN_FRONTEND=noninteractive pip 
install \requests \numpy# Build script looks for javac in jre dirENV 
JAVA_HOME "/usr/lib/jvm/java-8-openjdk-amd64"# 
http://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage#
 We have a pretty beefy serverENV MAVEN_OPTS "-Xmx20g 
-XX:ReservedCodeCacheSize=2g"

> Spark Jenkins supports testing GPU-aware scheduling features
> 
>
> Key: SPARK-27365
> URL: https://issues.apache.org/jira/browse/SPARK-27365
> Project: Spark
>  Issue Type: Story
>  Components: jenkins
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Upgrade Spark Jenkins to install GPU cards and run GPU integration tests 
> triggered by "GPU" in PRs.
> cc: [~afeng] [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27491) SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty response! therefore Airflow won't integrate with Spark 2.3.x

2019-04-17 Thread t oo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820542#comment-16820542
 ] 

t oo commented on SPARK-27491:
--

cc: [~skonto] [~mpmolek] [~gschiavon] [~scrapco...@gmail.com]

> SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty 
> response! therefore Airflow won't integrate with Spark 2.3.x
> --
>
> Key: SPARK-27491
> URL: https://issues.apache.org/jira/browse/SPARK-27491
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Scheduler, Spark Core, Spark Shell, Spark 
> Submit
>Affects Versions: 2.3.3
>Reporter: t oo
>Priority: Blocker
>
> This issue must have been introduced after Spark 2.1.1 as it is working in 
> that version. This issue is affecting me in Spark 2.3.3/2.3.0. I am using 
> spark standalone mode if that makes a difference.
> See below spark 2.3.3 returns empty response while 2.1.1 returns a response.
>  
> Spark 2.1.1:
> [ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home1/bin/spark-class 
> org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
> driver-20190417130324-0009
> + export SPARK_HOME=/home/ec2here/spark_home1
> + SPARK_HOME=/home/ec2here/spark_home1
> + '[' -z /home/ec2here/spark_home1 ']'
> + . /home/ec2here/spark_home1/bin/load-spark-env.sh
> ++ '[' -z /home/ec2here/spark_home1 ']'
> ++ '[' -z '' ']'
> ++ export SPARK_ENV_LOADED=1
> ++ SPARK_ENV_LOADED=1
> ++ parent_dir=/home/ec2here/spark_home1
> ++ user_conf_dir=/home/ec2here/spark_home1/conf
> ++ '[' -f /home/ec2here/spark_home1/conf/spark-env.sh ']'
> ++ set -a
> ++ . /home/ec2here/spark_home1/conf/spark-env.sh
> +++ export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
> +++ JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
>  ulimit -n 1048576
> ++ set +a
> ++ '[' -z '' ']'
> ++ ASSEMBLY_DIR2=/home/ec2here/spark_home1/assembly/target/scala-2.11
> ++ ASSEMBLY_DIR1=/home/ec2here/spark_home1/assembly/target/scala-2.10
> ++ [[ -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ]]
> ++ '[' -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ']'
> ++ export SPARK_SCALA_VERSION=2.10
> ++ SPARK_SCALA_VERSION=2.10
> + '[' -n /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 ']'
> + RUNNER=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java
> + '[' -d /home/ec2here/spark_home1/jars ']'
> + SPARK_JARS_DIR=/home/ec2here/spark_home1/jars
> + '[' '!' -d /home/ec2here/spark_home1/jars ']'
> + LAUNCH_CLASSPATH='/home/ec2here/spark_home1/jars/*'
> + '[' -n '' ']'
> + [[ -n '' ]]
> + CMD=()
> + IFS=
> + read -d '' -r ARG
> ++ build_command org.apache.spark.deploy.SparkSubmit --master 
> spark://domainhere:6066 --status driver-20190417130324-0009
> ++ /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -Xmx128m -cp 
> '/home/ec2here/spark_home1/jars/*' org.apache.spark.launcher.Main 
> org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
> driver-20190417130324-0009
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> ++ printf '%d\0' 0
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + COUNT=10
> + LAST=9
> + LAUNCHER_EXIT_CODE=0
> + [[ 0 =~ ^[0-9]+$ ]]
> + '[' 0 '!=' 0 ']'
> + CMD=("${CMD[@]:0:$LAST}")
> + exec /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -cp 
> '/home/ec2here/spark_home1/conf/:/home/ec2here/spark_home1/jars/*' -Xmx2048m 
> org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
> driver-20190417130324-0009
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 19/04/17 14:03:27 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20190417130324-0009 in spark://domainhere:6066.
> 19/04/17 14:03:28 INFO RestSubmissionClient: Server responded with 
> SubmissionStatusResponse:
> {
>  "action" : "SubmissionStatusResponse",
>  "driverState" : "FAILED",
>  "serverSparkVersion" : "2.3.3",
>  "submissionId" : "driver-20190417130324-0009",
>  "success" : true,
>  "workerHostPort" : "x.y.211.40:11819",
>  "workerId" : "worker-20190417115840-x.y.211.40-11819"
> }
> [ec2here@ip-x-y-160-225 ~]$
>  
> Spark 2.3.3:
> [ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home/bin/spark-class 
> org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
> driver-20190417130324-0009
> + '[' -z '' ']'
> ++ dirname

[jira] [Commented] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct

2019-04-17 Thread shahid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820525#comment-16820525
 ] 

shahid commented on SPARK-27468:


[~zsxwing] Thanks

> "Storage Level" in "RDD Storage Page" is not correct
> 
>
> Key: SPARK-27468
> URL: https://issues.apache.org/jira/browse/SPARK-27468
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Shixiong Zhu
>Priority: Major
> Attachments: Screenshot from 2019-04-17 10-42-55.png
>
>
> I ran the following unit test and checked the UI.
> {code}
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local-cluster[2,1,1024]")
>   .set("spark.ui.enabled", "true")
> sc = new SparkContext(conf)
> val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd.count()
> Thread.sleep(360)
> {code}
> The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
> page.
> I tried to debug and found this is because Spark emitted the following two 
> events:
> {code}
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
> replicas),56,0))
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
> 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
> replicas),56,0))
> {code}
> The storage level in the second event will overwrite the first one. "1 
> replicas" comes from this line: 
> https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457
> Maybe AppStatusListener should calculate the replicas from events?
> Another fact we may need to think about is when replicas is 2, will two Spark 
> events arrive in the same order? Currently, two RPCs from different executors 
> can arrive in any order.
> Credit goes to [~srfnmnk] who reported this issue originally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27496) RPC should send back the fatal errors

2019-04-17 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-27496:


 Summary: RPC should send back the fatal errors
 Key: SPARK-27496
 URL: https://issues.apache.org/jira/browse/SPARK-27496
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.1
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Right now, when a fatal error throws from "receiveAndReply", the sender will 
not be notified. We should try our best to send it back.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27434) memory leak in spark driver

2019-04-17 Thread Ryne Yang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820516#comment-16820516
 ] 

Ryne Yang commented on SPARK-27434:
---

[~shahid]  

were you able to reproduce this by the steps I provided? 

> memory leak in spark driver
> ---
>
> Key: SPARK-27434
> URL: https://issues.apache.org/jira/browse/SPARK-27434
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: OS: Centos 7
> JVM: 
> **_openjdk version "1.8.0_201"_
> _OpenJDK Runtime Environment (IcedTea 3.11.0) (Alpine 8.201.08-r0)_
> _OpenJDK 64-Bit Server VM (build 25.201-b08, mixed mode)_
> Spark version: 2.4.0
>Reporter: Ryne Yang
>Priority: Major
> Attachments: Screen Shot 2019-04-10 at 12.11.35 PM.png
>
>
> we got a OOM exception on the driver after driver has completed multiple 
> jobs(we are reusing spark context). 
> so we took a heap dump and looked at the leak analysis, found out that under 
> AsyncEventQueue there are 3.5GB of heap allocated. Possibly a leak. 
>  
> can someone take a look at? 
> here is the heap analysis: 
> !Screen Shot 2019-04-10 at 12.11.35 PM.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27473) Support filter push down for status fields in binary file data source

2019-04-17 Thread Xiangrui Meng (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820500#comment-16820500
 ] 

Xiangrui Meng commented on SPARK-27473:
---

Given SPARK-25558 is WIP, we might want to flatten the status column to support 
filter push down.

> Support filter push down for status fields in binary file data source
> -
>
> Key: SPARK-27473
> URL: https://issues.apache.org/jira/browse/SPARK-27473
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> As a user, I can use 
> `spark.read.format("binaryFile").load(path).filter($"status.lenght" < 
> 1L)` to load files that are less than 1e8 bytes. Spark shouldn't even 
> read files that are bigger than 1e8 bytes in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27473) Support filter push down for status fields in binary file data source

2019-04-17 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-27473:
-

Assignee: Weichen Xu

> Support filter push down for status fields in binary file data source
> -
>
> Key: SPARK-27473
> URL: https://issues.apache.org/jira/browse/SPARK-27473
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> As a user, I can use 
> `spark.read.format("binaryFile").load(path).filter($"status.lenght" < 
> 1L)` to load files that are less than 1e8 bytes. Spark shouldn't even 
> read files that are bigger than 1e8 bytes in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27495) Support Stage level resource scheduling

2019-04-17 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-27495:
-

 Summary: Support Stage level resource scheduling
 Key: SPARK-27495
 URL: https://issues.apache.org/jira/browse/SPARK-27495
 Project: Spark
  Issue Type: Story
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Thomas Graves


Currently Spark supports CPU level scheduling and we are adding in accelerator 
aware scheduling with https://issues.apache.org/jira/browse/SPARK-24615, but 
both of those are scheduling via application level configurations.  Meaning 
there is one configuration that is set for the entire lifetime of the 
application and the user can't change it between Spark jobs/stages within that 
application.  

Many times users have different requirements for different stages of their 
application so they want to be able to configure at the stage level what 
resources are required for that stage.

For example, I might start a spark application which first does some ETL work 
that needs lots of cores to run many tasks in parallel, then once that is done 
I want to run some ML job and at that point I want GPU's, less CPU's, and more 
memory.

With this Jira we want to add the ability for users to specify the resources 
for different stages.

Note that https://issues.apache.org/jira/browse/SPARK-24615 had some 
discussions on this but this part of it was removed from that.

We should come up with a proposal on how to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27495) Support Stage level resource configuration and scheduling

2019-04-17 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-27495:
--
Summary: Support Stage level resource configuration and scheduling  (was: 
Support Stage level resource scheduling)

> Support Stage level resource configuration and scheduling
> -
>
> Key: SPARK-27495
> URL: https://issues.apache.org/jira/browse/SPARK-27495
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> Currently Spark supports CPU level scheduling and we are adding in 
> accelerator aware scheduling with 
> https://issues.apache.org/jira/browse/SPARK-24615, but both of those are 
> scheduling via application level configurations.  Meaning there is one 
> configuration that is set for the entire lifetime of the application and the 
> user can't change it between Spark jobs/stages within that application.  
> Many times users have different requirements for different stages of their 
> application so they want to be able to configure at the stage level what 
> resources are required for that stage.
> For example, I might start a spark application which first does some ETL work 
> that needs lots of cores to run many tasks in parallel, then once that is 
> done I want to run some ML job and at that point I want GPU's, less CPU's, 
> and more memory.
> With this Jira we want to add the ability for users to specify the resources 
> for different stages.
> Note that https://issues.apache.org/jira/browse/SPARK-24615 had some 
> discussions on this but this part of it was removed from that.
> We should come up with a proposal on how to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27468) "Storage Level" in "RDD Storage Page" is not correct

2019-04-17 Thread Shixiong Zhu (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820460#comment-16820460
 ] 

Shixiong Zhu commented on SPARK-27468:
--

[~shahid] You need to use "--master local-cluster[2,1,1024]". The local mode 
has only one BlockManager.

> "Storage Level" in "RDD Storage Page" is not correct
> 
>
> Key: SPARK-27468
> URL: https://issues.apache.org/jira/browse/SPARK-27468
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Shixiong Zhu
>Priority: Major
> Attachments: Screenshot from 2019-04-17 10-42-55.png
>
>
> I ran the following unit test and checked the UI.
> {code}
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local-cluster[2,1,1024]")
>   .set("spark.ui.enabled", "true")
> sc = new SparkContext(conf)
> val rdd = sc.makeRDD(1 to 10, 1).persist(StorageLevel.MEMORY_ONLY_2)
> rdd.count()
> Thread.sleep(360)
> {code}
> The storage level is "Memory Deserialized 1x Replicated" in the RDD storage 
> page.
> I tried to debug and found this is because Spark emitted the following two 
> events:
> {code}
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(1, 
> 10.8.132.160, 65473, None),rdd_0_0,StorageLevel(memory, deserialized, 2 
> replicas),56,0))
> event: SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(0, 
> 10.8.132.160, 65474, None),rdd_0_0,StorageLevel(memory, deserialized, 1 
> replicas),56,0))
> {code}
> The storage level in the second event will overwrite the first one. "1 
> replicas" comes from this line: 
> https://github.com/apache/spark/blob/3ab96d7acf870e53c9016b0b63d0b328eec23bed/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1457
> Maybe AppStatusListener should calculate the replicas from events?
> Another fact we may need to think about is when replicas is 2, will two Spark 
> events arrive in the same order? Currently, two RPCs from different executors 
> can arrive in any order.
> Credit goes to [~srfnmnk] who reported this issue originally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27494) Null values don't work in Kafka source v2

2019-04-17 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-27494:


 Summary: Null values don't work in Kafka source v2
 Key: SPARK-27494
 URL: https://issues.apache.org/jira/browse/SPARK-27494
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.1
Reporter: Shixiong Zhu


Right now Kafka source v2 doesn't support null values. The issue is in 
org.apache.spark.sql.kafka010.KafkaRecordToUnsafeRowConverter.toUnsafeRow which 
doesn't handle null values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27493) Upgrade ASM to 7.1

2019-04-17 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27493:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-24417

> Upgrade ASM to 7.1
> --
>
> Key: SPARK-27493
> URL: https://issues.apache.org/jira/browse/SPARK-27493
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> SPARK-25946 upgrades ASM to 7.0 to support JDK11. This PR aims to update ASM 
> to 7.1 to bring the bug fixes.
> - https://asm.ow2.io/versions.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27493) Upgrade ASM to 7.1

2019-04-17 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-27493:
-

 Summary: Upgrade ASM to 7.1
 Key: SPARK-27493
 URL: https://issues.apache.org/jira/browse/SPARK-27493
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


SPARK-25946 upgrades ASM to 7.0 to support JDK11. This PR aims to update ASM to 
7.1 to bring the bug fixes.

- https://asm.ow2.io/versions.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27276) Increase the minimum pyarrow version to 0.12.1

2019-04-17 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820370#comment-16820370
 ] 

shane knapp commented on SPARK-27276:
-

test currently running:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/104671/

> Increase the minimum pyarrow version to 0.12.1
> --
>
> Key: SPARK-27276
> URL: https://issues.apache.org/jira/browse/SPARK-27276
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> The current minimum version is 0.8.0, which is pretty ancient since Arrow has 
> been moving fast and a lot has changed since this version. There are 
> currently many workarounds checking for different versions or disabling 
> specific functionality, and the code is getting ugly and difficult to 
> maintain. Increasing the version will allow cleanup and upgrade the testing 
> environment.
> This involves changing the pyarrow version in setup.py (currently at 0.8.0), 
> updating Jenkins to test against the new version, code cleanup to remove 
> workarounds from older versions.  Newer versions of pyarrow have dropped 
> support for Python 3.4, so it might be necessary to update to Python 3.5+ in 
> Jenkins as well. Users would then need to ensure at least this version of 
> pyarrow is installed on the cluster.
> There is also a 0.12.1 release, so I will need to check what bugs that fixed 
> to see if that will be a better version.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.6

2019-04-17 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820364#comment-16820364
 ] 

shane knapp commented on SPARK-25079:
-

from my email to dev@:

ok.

after much wailing and gnashing of teeth (and conversations w/[~bryanc]), i 
think we're coming to a general idea of how python testing will soon work!

i propose the following:

py27: master, 2.3, 2.4
py36 + pandas 0.19.2 + pyarrow 0.8.0: 2.3, 2.4
py36 + pandas 0.24.2 + pyarrow 0.12.1: master

all of the above combinations have been tested (locally) and pass.  i will need 
to create/deploy the new 2.3/4 branch python envs and then test my two PRs 
against them.

the good:
1) this, IMO, will get us to a place where we can get all spark python tests 
using py36 as quickly as possible w/o needing to backport and spend a ton of 
time fixing 2.3/4 tests.

2) there is literally *one* hardcoded path (in dev/run-tests.py) that needs to 
be updated on 2.3/4 to point to a different python env than 'py3k'.

the bad:
1) three python envs to deal with (with the env supporting 2.3 and 2.4 
remaining relatively static).

since the 'good' definitely outweighs the 'bad', my vote is for 'good'.  ;)

also:  i am putting my foot down and we won't be testing against more than 
three python envs!



> [PYTHON] upgrade python 3.4 -> 3.6
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.5.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-27389) pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"

2019-04-17 Thread shane knapp (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp closed SPARK-27389.
---

> pyspark test failures w/ "UnknownTimeZoneError: 'US/Pacific-New'"
> -
>
> Key: SPARK-27389
> URL: https://issues.apache.org/jira/browse/SPARK-27389
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, PySpark
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Assignee: shane knapp
>Priority: Major
>
> I've seen a few odd PR build failures w/ an error in pyspark tests about 
> "UnknownTimeZoneError: 'US/Pacific-New'".  eg. 
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4688/consoleFull
> A bit of searching tells me that US/Pacific-New probably isn't really 
> supposed to be a timezone at all: 
> https://mm.icann.org/pipermail/tz/2009-February/015448.html
> I'm guessing that this is from some misconfiguration of jenkins.  that said, 
> I can't figure out what is wrong.  There does seem to be a timezone entry for 
> US/Pacific-New in {{/usr/share/zoneinfo/US/Pacific-New}} -- but it seems to 
> be there on every amp-jenkins-worker, so I dunno what that alone would cause 
> this failure sometime.
> [~shaneknapp] I am tentatively calling this a "jenkins" issue, but I might be 
> totally wrong here and it is really a pyspark problem.
> Full Stack trace from the test failure:
> {noformat}
> ==
> ERROR: test_to_pandas (pyspark.sql.tests.test_dataframe.DataFrameTests)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 522, in test_to_pandas
> pdf = self._to_pandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/tests/test_dataframe.py",
>  line 517, in _to_pandas
> return df.toPandas()
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/dataframe.py",
>  line 2189, in toPandas
> _check_series_convert_timestamps_local_tz(pdf[field.name], timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1891, in _check_series_convert_timestamps_local_tz
> return _check_series_convert_timestamps_localize(s, None, timezone)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1877, in _check_series_convert_timestamps_localize
> lambda ts: ts.tz_localize(from_tz, 
> ambiguous=False).tz_convert(to_tz).tz_localize(None)
>   File "/home/anaconda/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2294, in apply
> mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/src/inference.pyx", line 1207, in pandas.lib.map_infer 
> (pandas/lib.c:66124)
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder@2/python/pyspark/sql/types.py",
>  line 1878, in 
> if ts is not pd.NaT else pd.NaT)
>   File "pandas/tslib.pyx", line 649, in pandas.tslib.Timestamp.tz_convert 
> (pandas/tslib.c:13923)
>   File "pandas/tslib.pyx", line 407, in pandas.tslib.Timestamp.__new__ 
> (pandas/tslib.c:10447)
>   File "pandas/tslib.pyx", line 1467, in pandas.tslib.convert_to_tsobject 
> (pandas/tslib.c:27504)
>   File "pandas/tslib.pyx", line 1768, in pandas.tslib.maybe_get_tz 
> (pandas/tslib.c:32362)
>   File "/home/anaconda/lib/python2.7/site-packages/pytz/__init__.py", line 
> 178, in timezone
> raise UnknownTimeZoneError(zone)
> UnknownTimeZoneError: 'US/Pacific-New'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25088) Rest Server default & doc updates

2019-04-17 Thread Imran Rashid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820321#comment-16820321
 ] 

Imran Rashid commented on SPARK-25088:
--

if you're allowing unauthed rest, what is the point of auth on standard 
submission?  For most users, they'd just think they had a secure setup with 
auth on standard submission, and not realize they'd left a backdoor wide open.  
Its not worth that security risk

> Rest Server default & doc updates
> -
>
> Key: SPARK-25088
> URL: https://issues.apache.org/jira/browse/SPARK-25088
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> The rest server could use some updates on defaults & docs, both in standalone 
> and mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6235) Address various 2G limits

2019-04-17 Thread Imran Rashid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820309#comment-16820309
 ] 

Imran Rashid commented on SPARK-6235:
-

[~glal14] actually this was fixed in 2.4.  There was one open issue, 
SPARK-24936, but I just closed that as its just improving an error msg which I 
think isn't really worth fixing just for spark 3.0, and so also resolved this 
umbrella.

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6235) Address various 2G limits

2019-04-17 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-6235.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24936) Better error message when trying a shuffle fetch over 2 GB

2019-04-17 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-24936.
--
Resolution: Won't Fix

As we've already shipped 2.4, I think its unlikely we're going to fix this 
later.  I don't think we need to worry that much about spark 3.0 talking to 
shuffle services < 2.2.

If anybody is motivated, feel free to submit a pr here, but I think leaving 
this open is probably misleading about the status.

> Better error message when trying a shuffle fetch over 2 GB
> --
>
> Key: SPARK-24936
> URL: https://issues.apache.org/jira/browse/SPARK-24936
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>
> After SPARK-24297, spark will try to fetch shuffle blocks to disk if their 
> over 2GB.  However, this will fail with an external shuffle service running < 
> spark 2.2, with an unhelpful error message like:
> {noformat}
> 18/07/26 07:15:02 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.3 
> (TID 15, xyz.com, executor 2): FetchFailed(BlockManagerId(1
> , xyz.com, 7337, None), shuffleId=0, mapId=1, reduceId=1, message=
> org.apache.spark.shuffle.FetchFailedException: 
> java.lang.UnsupportedOperationException
> at 
> org.apache.spark.network.server.StreamManager.openStream(StreamManager.java:60)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
> ...
> {noformat}
> We can't do anything to make the shuffle succeed, in this situation, but we 
> should fail with a better error message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27492) High level user documentation

2019-04-17 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-27492:
-

 Summary: High level user documentation
 Key: SPARK-27492
 URL: https://issues.apache.org/jira/browse/SPARK-27492
 Project: Spark
  Issue Type: Story
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Thomas Graves


Add some high level user documentation about how this feature works together 
and point to things like the example discovery script, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27403) Fix `updateTableStats` to update table stats always with new stats or None

2019-04-17 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27403:
--
Summary: Fix `updateTableStats` to update table stats always with new stats 
or None  (was: Failed to update the table size automatically even though 
spark.sql.statistics.size.autoUpdate.enabled is set as rue)

> Fix `updateTableStats` to update table stats always with new stats or None
> --
>
> Key: SPARK-27403
> URL: https://issues.apache.org/jira/browse/SPARK-27403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1
>Reporter: Sujith Chacko
>Assignee: Sujith Chacko
>Priority: Major
> Fix For: 2.4.2, 3.0.0
>
>
> system shall update the table stats automatically if user set 
> spark.sql.statistics.size.autoUpdate.enabled as true, currently this property 
> is not having any significance even if it is enabled or disabled. This 
> feature is similar to Hives auto-gather feature where statistics are 
> automatically computed by default if this feature is enabled.
> Reference:
> [https://cwiki.apache.org/confluence/display/Hive/StatsDev]
> Reproducing steps:
> scala> spark.sql("create table table1 (name string,age int) stored as 
> parquet")
> scala> spark.sql("insert into table1 select 'a',29")
>  res2: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("desc extended table1").show(false)
>  
> +---+---++---
> |col_name|data_type|comment|
> +---+---++---
> |name|string|null|
> |age|int|null|
> | | | |
> | # Detailed Table Information| | |
> |Database|default| |
> |Table|table1| |
> |Owner|Administrator| |
> |Created Time|Sun Apr 07 23:41:56 IST 2019| |
> |Last Access|Thu Jan 01 05:30:00 IST 1970| |
> |Created By|Spark 2.4.1| |
> |Type|MANAGED| |
> |Provider|hive| |
> |Table Properties|[transient_lastDdlTime=1554660716]| |
> |Location|file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1| |
> |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| |
> |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| |
> |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| 
> |
> |Storage Properties|[serialization.format=1]| |
> |Partition Provider|Catalog| |
> +---+---++---



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27403) Failed to update the table size automatically even though spark.sql.statistics.size.autoUpdate.enabled is set as rue

2019-04-17 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27403:
--
Fix Version/s: 2.4.2

> Failed to update the table size automatically even though 
> spark.sql.statistics.size.autoUpdate.enabled is set as rue
> 
>
> Key: SPARK-27403
> URL: https://issues.apache.org/jira/browse/SPARK-27403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1
>Reporter: Sujith Chacko
>Assignee: Sujith Chacko
>Priority: Major
> Fix For: 2.4.2, 3.0.0
>
>
> system shall update the table stats automatically if user set 
> spark.sql.statistics.size.autoUpdate.enabled as true, currently this property 
> is not having any significance even if it is enabled or disabled. This 
> feature is similar to Hives auto-gather feature where statistics are 
> automatically computed by default if this feature is enabled.
> Reference:
> [https://cwiki.apache.org/confluence/display/Hive/StatsDev]
> Reproducing steps:
> scala> spark.sql("create table table1 (name string,age int) stored as 
> parquet")
> scala> spark.sql("insert into table1 select 'a',29")
>  res2: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("desc extended table1").show(false)
>  
> +---+---++---
> |col_name|data_type|comment|
> +---+---++---
> |name|string|null|
> |age|int|null|
> | | | |
> | # Detailed Table Information| | |
> |Database|default| |
> |Table|table1| |
> |Owner|Administrator| |
> |Created Time|Sun Apr 07 23:41:56 IST 2019| |
> |Last Access|Thu Jan 01 05:30:00 IST 1970| |
> |Created By|Spark 2.4.1| |
> |Type|MANAGED| |
> |Provider|hive| |
> |Table Properties|[transient_lastDdlTime=1554660716]| |
> |Location|file:/D:/spark-2.4.1-bin-hadoop2.7/bin/spark-warehouse/table1| |
> |Serde Library|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe| |
> |InputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat| |
> |OutputFormat|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| 
> |
> |Storage Properties|[serialization.format=1]| |
> |Partition Provider|Catalog| |
> +---+---++---



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6235) Address various 2G limits

2019-04-17 Thread Gowtam Lal (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820229#comment-16820229
 ] 

Gowtam Lal commented on SPARK-6235:
---

It would be great to see this go out. Any updates?

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27491) SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty response! therefore Airflow won't integrate with Spark 2.3.x

2019-04-17 Thread t oo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

t oo updated SPARK-27491:
-
Summary: SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" 
returns empty response! therefore Airflow won't integrate with Spark 2.3.x  
(was: SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns 
empty response! therefore Airflow won't integrate with Spark)

> SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty 
> response! therefore Airflow won't integrate with Spark 2.3.x
> --
>
> Key: SPARK-27491
> URL: https://issues.apache.org/jira/browse/SPARK-27491
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Scheduler, Spark Core, Spark Shell, Spark 
> Submit
>Affects Versions: 2.3.3
>Reporter: t oo
>Priority: Blocker
>
> This issue must have been introduced after Spark 2.1.1 as it is working in 
> that version. This issue is affecting me in Spark 2.3.3/2.3.0. I am using 
> spark standalone mode if that makes a difference.
> See below spark 2.3.3 returns empty response while 2.1.1 returns a response.
>  
> Spark 2.1.1:
> [ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home1/bin/spark-class 
> org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
> driver-20190417130324-0009
> + export SPARK_HOME=/home/ec2here/spark_home1
> + SPARK_HOME=/home/ec2here/spark_home1
> + '[' -z /home/ec2here/spark_home1 ']'
> + . /home/ec2here/spark_home1/bin/load-spark-env.sh
> ++ '[' -z /home/ec2here/spark_home1 ']'
> ++ '[' -z '' ']'
> ++ export SPARK_ENV_LOADED=1
> ++ SPARK_ENV_LOADED=1
> ++ parent_dir=/home/ec2here/spark_home1
> ++ user_conf_dir=/home/ec2here/spark_home1/conf
> ++ '[' -f /home/ec2here/spark_home1/conf/spark-env.sh ']'
> ++ set -a
> ++ . /home/ec2here/spark_home1/conf/spark-env.sh
> +++ export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
> +++ JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
>  ulimit -n 1048576
> ++ set +a
> ++ '[' -z '' ']'
> ++ ASSEMBLY_DIR2=/home/ec2here/spark_home1/assembly/target/scala-2.11
> ++ ASSEMBLY_DIR1=/home/ec2here/spark_home1/assembly/target/scala-2.10
> ++ [[ -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ]]
> ++ '[' -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ']'
> ++ export SPARK_SCALA_VERSION=2.10
> ++ SPARK_SCALA_VERSION=2.10
> + '[' -n /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 ']'
> + RUNNER=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java
> + '[' -d /home/ec2here/spark_home1/jars ']'
> + SPARK_JARS_DIR=/home/ec2here/spark_home1/jars
> + '[' '!' -d /home/ec2here/spark_home1/jars ']'
> + LAUNCH_CLASSPATH='/home/ec2here/spark_home1/jars/*'
> + '[' -n '' ']'
> + [[ -n '' ]]
> + CMD=()
> + IFS=
> + read -d '' -r ARG
> ++ build_command org.apache.spark.deploy.SparkSubmit --master 
> spark://domainhere:6066 --status driver-20190417130324-0009
> ++ /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -Xmx128m -cp 
> '/home/ec2here/spark_home1/jars/*' org.apache.spark.launcher.Main 
> org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
> driver-20190417130324-0009
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> ++ printf '%d\0' 0
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + COUNT=10
> + LAST=9
> + LAUNCHER_EXIT_CODE=0
> + [[ 0 =~ ^[0-9]+$ ]]
> + '[' 0 '!=' 0 ']'
> + CMD=("${CMD[@]:0:$LAST}")
> + exec /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -cp 
> '/home/ec2here/spark_home1/conf/:/home/ec2here/spark_home1/jars/*' -Xmx2048m 
> org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
> driver-20190417130324-0009
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 19/04/17 14:03:27 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20190417130324-0009 in spark://domainhere:6066.
> 19/04/17 14:03:28 INFO RestSubmissionClient: Server responded with 
> SubmissionStatusResponse:
> {
>  "action" : "SubmissionStatusResponse",
>  "driverState" : "FAILED",
>  "serverSparkVersion" : "2.3.3",
>  "submissionId" : "driver-20190417130324-0009",
>  "success" : true,
>  "workerHostPort" : "x.y.211.40:11819",
>  "workerId" : "worker-20190417115840-x.y.211.40-11819"
> }
> [ec2here@ip-x-y-160-225 ~]$
>  
> Spark 2.3.3:
> [ec2here@ip-x-y-160-225 ~]$ bash -x

[jira] [Commented] (SPARK-27491) SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty response! therefore Airflow won't integrate with Spark 2.3.x

2019-04-17 Thread t oo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820211#comment-16820211
 ] 

t oo commented on SPARK-27491:
--

cc: [~ash] [~bolke]

> SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty 
> response! therefore Airflow won't integrate with Spark 2.3.x
> --
>
> Key: SPARK-27491
> URL: https://issues.apache.org/jira/browse/SPARK-27491
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Scheduler, Spark Core, Spark Shell, Spark 
> Submit
>Affects Versions: 2.3.3
>Reporter: t oo
>Priority: Blocker
>
> This issue must have been introduced after Spark 2.1.1 as it is working in 
> that version. This issue is affecting me in Spark 2.3.3/2.3.0. I am using 
> spark standalone mode if that makes a difference.
> See below spark 2.3.3 returns empty response while 2.1.1 returns a response.
>  
> Spark 2.1.1:
> [ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home1/bin/spark-class 
> org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
> driver-20190417130324-0009
> + export SPARK_HOME=/home/ec2here/spark_home1
> + SPARK_HOME=/home/ec2here/spark_home1
> + '[' -z /home/ec2here/spark_home1 ']'
> + . /home/ec2here/spark_home1/bin/load-spark-env.sh
> ++ '[' -z /home/ec2here/spark_home1 ']'
> ++ '[' -z '' ']'
> ++ export SPARK_ENV_LOADED=1
> ++ SPARK_ENV_LOADED=1
> ++ parent_dir=/home/ec2here/spark_home1
> ++ user_conf_dir=/home/ec2here/spark_home1/conf
> ++ '[' -f /home/ec2here/spark_home1/conf/spark-env.sh ']'
> ++ set -a
> ++ . /home/ec2here/spark_home1/conf/spark-env.sh
> +++ export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
> +++ JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
>  ulimit -n 1048576
> ++ set +a
> ++ '[' -z '' ']'
> ++ ASSEMBLY_DIR2=/home/ec2here/spark_home1/assembly/target/scala-2.11
> ++ ASSEMBLY_DIR1=/home/ec2here/spark_home1/assembly/target/scala-2.10
> ++ [[ -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ]]
> ++ '[' -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ']'
> ++ export SPARK_SCALA_VERSION=2.10
> ++ SPARK_SCALA_VERSION=2.10
> + '[' -n /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 ']'
> + RUNNER=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java
> + '[' -d /home/ec2here/spark_home1/jars ']'
> + SPARK_JARS_DIR=/home/ec2here/spark_home1/jars
> + '[' '!' -d /home/ec2here/spark_home1/jars ']'
> + LAUNCH_CLASSPATH='/home/ec2here/spark_home1/jars/*'
> + '[' -n '' ']'
> + [[ -n '' ]]
> + CMD=()
> + IFS=
> + read -d '' -r ARG
> ++ build_command org.apache.spark.deploy.SparkSubmit --master 
> spark://domainhere:6066 --status driver-20190417130324-0009
> ++ /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -Xmx128m -cp 
> '/home/ec2here/spark_home1/jars/*' org.apache.spark.launcher.Main 
> org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
> driver-20190417130324-0009
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> ++ printf '%d\0' 0
> + CMD+=("$ARG")
> + IFS=
> + read -d '' -r ARG
> + COUNT=10
> + LAST=9
> + LAUNCHER_EXIT_CODE=0
> + [[ 0 =~ ^[0-9]+$ ]]
> + '[' 0 '!=' 0 ']'
> + CMD=("${CMD[@]:0:$LAST}")
> + exec /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -cp 
> '/home/ec2here/spark_home1/conf/:/home/ec2here/spark_home1/jars/*' -Xmx2048m 
> org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
> driver-20190417130324-0009
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 19/04/17 14:03:27 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20190417130324-0009 in spark://domainhere:6066.
> 19/04/17 14:03:28 INFO RestSubmissionClient: Server responded with 
> SubmissionStatusResponse:
> {
>  "action" : "SubmissionStatusResponse",
>  "driverState" : "FAILED",
>  "serverSparkVersion" : "2.3.3",
>  "submissionId" : "driver-20190417130324-0009",
>  "success" : true,
>  "workerHostPort" : "x.y.211.40:11819",
>  "workerId" : "worker-20190417115840-x.y.211.40-11819"
> }
> [ec2here@ip-x-y-160-225 ~]$
>  
> Spark 2.3.3:
> [ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home/bin/spark-class 
> org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
> driver-20190417130324-0009
> + '[' -z '' ']'
> ++ dirname /home/ec2here/spark_home/bin/spark-class
> + source

[jira] [Created] (SPARK-27491) SPARK REST API - "org.apache.spark.deploy.SparkSubmit --status" returns empty response! therefore Airflow won't integrate with Spark

2019-04-17 Thread t oo (JIRA)

t oo created SPARK-27491:


 Summary: SPARK REST API - "org.apache.spark.deploy.SparkSubmit 
--status" returns empty response! therefore Airflow won't integrate with Spark
 Key: SPARK-27491
 URL: https://issues.apache.org/jira/browse/SPARK-27491
 Project: Spark
  Issue Type: Bug
  Components: Java API, Scheduler, Spark Core, Spark Shell, Spark Submit
Affects Versions: 2.3.3
Reporter: t oo


This issue must have been introduced after Spark 2.1.1 as it is working in that 
version. This issue is affecting me in Spark 2.3.3/2.3.0. I am using spark 
standalone mode if that makes a difference.

See below spark 2.3.3 returns empty response while 2.1.1 returns a response.

 

Spark 2.1.1:

[ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home1/bin/spark-class 
org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
driver-20190417130324-0009
+ export SPARK_HOME=/home/ec2here/spark_home1
+ SPARK_HOME=/home/ec2here/spark_home1
+ '[' -z /home/ec2here/spark_home1 ']'
+ . /home/ec2here/spark_home1/bin/load-spark-env.sh
++ '[' -z /home/ec2here/spark_home1 ']'
++ '[' -z '' ']'
++ export SPARK_ENV_LOADED=1
++ SPARK_ENV_LOADED=1
++ parent_dir=/home/ec2here/spark_home1
++ user_conf_dir=/home/ec2here/spark_home1/conf
++ '[' -f /home/ec2here/spark_home1/conf/spark-env.sh ']'
++ set -a
++ . /home/ec2here/spark_home1/conf/spark-env.sh
+++ export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
+++ JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
 ulimit -n 1048576
++ set +a
++ '[' -z '' ']'
++ ASSEMBLY_DIR2=/home/ec2here/spark_home1/assembly/target/scala-2.11
++ ASSEMBLY_DIR1=/home/ec2here/spark_home1/assembly/target/scala-2.10
++ [[ -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ]]
++ '[' -d /home/ec2here/spark_home1/assembly/target/scala-2.11 ']'
++ export SPARK_SCALA_VERSION=2.10
++ SPARK_SCALA_VERSION=2.10
+ '[' -n /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 ']'
+ RUNNER=/usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java
+ '[' -d /home/ec2here/spark_home1/jars ']'
+ SPARK_JARS_DIR=/home/ec2here/spark_home1/jars
+ '[' '!' -d /home/ec2here/spark_home1/jars ']'
+ LAUNCH_CLASSPATH='/home/ec2here/spark_home1/jars/*'
+ '[' -n '' ']'
+ [[ -n '' ]]
+ CMD=()
+ IFS=
+ read -d '' -r ARG
++ build_command org.apache.spark.deploy.SparkSubmit --master 
spark://domainhere:6066 --status driver-20190417130324-0009
++ /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -Xmx128m -cp 
'/home/ec2here/spark_home1/jars/*' org.apache.spark.launcher.Main 
org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
driver-20190417130324-0009
+ CMD+=("$ARG")
+ IFS=
+ read -d '' -r ARG
+ CMD+=("$ARG")
+ IFS=
+ read -d '' -r ARG
+ CMD+=("$ARG")
+ IFS=
+ read -d '' -r ARG
+ CMD+=("$ARG")
+ IFS=
+ read -d '' -r ARG
+ CMD+=("$ARG")
+ IFS=
+ read -d '' -r ARG
+ CMD+=("$ARG")
+ IFS=
+ read -d '' -r ARG
+ CMD+=("$ARG")
+ IFS=
+ read -d '' -r ARG
+ CMD+=("$ARG")
+ IFS=
+ read -d '' -r ARG
+ CMD+=("$ARG")
+ IFS=
+ read -d '' -r ARG
++ printf '%d\0' 0
+ CMD+=("$ARG")
+ IFS=
+ read -d '' -r ARG
+ COUNT=10
+ LAST=9
+ LAUNCHER_EXIT_CODE=0
+ [[ 0 =~ ^[0-9]+$ ]]
+ '[' 0 '!=' 0 ']'
+ CMD=("${CMD[@]:0:$LAST}")
+ exec /usr/lib/jvm/jre-1.8.0-openjdk.x86_64/bin/java -cp 
'/home/ec2here/spark_home1/conf/:/home/ec2here/spark_home1/jars/*' -Xmx2048m 
org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
driver-20190417130324-0009
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/04/17 14:03:27 INFO RestSubmissionClient: Submitting a request for the 
status of submission driver-20190417130324-0009 in spark://domainhere:6066.
19/04/17 14:03:28 INFO RestSubmissionClient: Server responded with 
SubmissionStatusResponse:
{
 "action" : "SubmissionStatusResponse",
 "driverState" : "FAILED",
 "serverSparkVersion" : "2.3.3",
 "submissionId" : "driver-20190417130324-0009",
 "success" : true,
 "workerHostPort" : "x.y.211.40:11819",
 "workerId" : "worker-20190417115840-x.y.211.40-11819"
}
[ec2here@ip-x-y-160-225 ~]$

 


Spark 2.3.3:


[ec2here@ip-x-y-160-225 ~]$ bash -x /home/ec2here/spark_home/bin/spark-class 
org.apache.spark.deploy.SparkSubmit --master spark://domainhere:6066 --status 
driver-20190417130324-0009
+ '[' -z '' ']'
++ dirname /home/ec2here/spark_home/bin/spark-class
+ source /home/ec2here/spark_home/bin/find-spark-home
 dirname /home/ec2here/spark_home/bin/spark-class
+++ cd /home/ec2here/spark_home/bin
+++ pwd
++ FIND_SPARK_HOME_PYTHON_SCRIPT=/home/ec2here/spark_home/bin/find_spark_home.py
++ '[' '!' -z '' ']'
++ '[' '!' -f /home/ec2here/spark_home/bin/find_spark_home.py ']'
 dirname /home/ec2here/spark_home/bin/spark-class
+++ cd /home/ec2here/spark_home/bin/..
+++ pwd
++ export SPARK_HOME=/home/ec2here/spark_home
++ SPARK_HOME=/home/ec2here/spark_home
+ . /home/ec2here/spark_home/bin/load-spark-env.sh
++ '[' -z /home/ec2here/spark_home ']'
++

[jira] [Commented] (SPARK-27485) Certain query plans fail to run when autoBroadcastJoinThreshold is set to -1

2019-04-17 Thread shahid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820189#comment-16820189
 ] 

shahid commented on SPARK-27485:


Could you please share test to reproduce this?

> Certain query plans fail to run when autoBroadcastJoinThreshold is set to -1
> 
>
> Key: SPARK-27485
> URL: https://issues.apache.org/jira/browse/SPARK-27485
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.4.0
>Reporter: Muthu Jayakumar
>Priority: Minor
>
> Certain queries fail with
> {noformat}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:349)
>   at scala.None$.get(Option.scala:347)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$reorder$1(EnsureRequirements.scala:238)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.$anonfun$reorder$1$adapted(EnsureRequirements.scala:233)
>   at scala.collection.immutable.List.foreach(List.scala:388)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.reorder(EnsureRequirements.scala:233)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.reorderJoinKeys(EnsureRequirements.scala:262)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.org$apache$spark$sql$execution$exchange$EnsureRequirements$$reorderJoinPredicates(EnsureRequirements.scala:289)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:304)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements$$anonfun$apply$1.applyOrElse(EnsureRequirements.scala:296)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$4(TreeNode.scala:282)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:282)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:275)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:326)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:296)
>   at 
> org.apache.spark.sql.execution.exchange.EnsureRequirements.apply(EnsureRequirements.scala:38)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$prepareForExecution$1(QueryExecution.scala:87)
>   at 
>

[jira] [Created] (SPARK-27490) File source V2: return correct result for Dataset.inputFiles()

2019-04-17 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-27490:
--

 Summary: File source V2: return correct result for 
Dataset.inputFiles()
 Key: SPARK-27490
 URL: https://issues.apache.org/jira/browse/SPARK-27490
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


Currently, a `Dateset` with file source V2 always return empty results for 
method `Dataset.inputFiles()`.

We should fix it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27489) UI updates to show executor resource information

2019-04-17 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-27489:
-

 Summary: UI updates to show executor resource information
 Key: SPARK-27489
 URL: https://issues.apache.org/jira/browse/SPARK-27489
 Project: Spark
  Issue Type: Story
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Thomas Graves
Assignee: Thomas Graves


We are adding other resource type support to the executors and Spark. We should 
show the resource information for each executor on the UI Executors page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-27364) User-facing APIs for GPU-aware scheduling

2019-04-17 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reopened SPARK-27364:
---

reopening since it has a subtask

> User-facing APIs for GPU-aware scheduling
> -
>
> Key: SPARK-27364
> URL: https://issues.apache.org/jira/browse/SPARK-27364
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
>
> Design and implement:
> * General guidelines for cluster managers to understand resource requests at 
> application start. The concrete conf/param will be under the design of each 
> cluster manager.
> * APIs to fetch assigned resources from task context.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27364) User-facing APIs for GPU-aware scheduling

2019-04-17 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820120#comment-16820120
 ] 

Thomas Graves commented on SPARK-27364:
---

based on no comments on this I'm going to resolve this and we can discuss more 
in the prs for implementation.

> User-facing APIs for GPU-aware scheduling
> -
>
> Key: SPARK-27364
> URL: https://issues.apache.org/jira/browse/SPARK-27364
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
>
> Design and implement:
> * General guidelines for cluster managers to understand resource requests at 
> application start. The concrete conf/param will be under the design of each 
> cluster manager.
> * APIs to fetch assigned resources from task context.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27364) User-facing APIs for GPU-aware scheduling

2019-04-17 Thread Thomas Graves (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-27364.
---
Resolution: Fixed

> User-facing APIs for GPU-aware scheduling
> -
>
> Key: SPARK-27364
> URL: https://issues.apache.org/jira/browse/SPARK-27364
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
>
> Design and implement:
> * General guidelines for cluster managers to understand resource requests at 
> application start. The concrete conf/param will be under the design of each 
> cluster manager.
> * APIs to fetch assigned resources from task context.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27488) Driver interface to support GPU resources

2019-04-17 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820119#comment-16820119
 ] 

Thomas Graves commented on SPARK-27488:
---

Note, the api design is here: https://issues.apache.org/jira/browse/SPARK-27364

> Driver interface to support GPU resources 
> --
>
> Key: SPARK-27488
> URL: https://issues.apache.org/jira/browse/SPARK-27488
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>
> We want to have an interface to allow the users on the driver to get what 
> resources are allocated to them.  This is mostly to handle the case the 
> cluster manager does not launch the driver in an isolated environment and 
> where users could be sharing hosts.  For instance in standalone mode it 
> doesn't support container isolation so a host may have 4 gpu's, but only 2 of 
> them could be assigned to the driver.  In this case we need an interface for 
> the cluster manager to specify what gpu's for the driver to use and an 
> interface for the user to get the resource information



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27488) Driver interface to support GPU resources

2019-04-17 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-27488:
-

 Summary: Driver interface to support GPU resources 
 Key: SPARK-27488
 URL: https://issues.apache.org/jira/browse/SPARK-27488
 Project: Spark
  Issue Type: Story
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Thomas Graves
Assignee: Thomas Graves


We want to have an interface to allow the users on the driver to get what 
resources are allocated to them.  This is mostly to handle the case the cluster 
manager does not launch the driver in an isolated environment and where users 
could be sharing hosts.  For instance in standalone mode it doesn't support 
container isolation so a host may have 4 gpu's, but only 2 of them could be 
assigned to the driver.  In this case we need an interface for the cluster 
manager to specify what gpu's for the driver to use and an interface for the 
user to get the resource information



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2019-04-17 Thread Dave DeCaprio (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820114#comment-16820114
 ] 

Dave DeCaprio commented on SPARK-23904:
---

No, it's just in master, which is the 3.X branch.

I do have a backports of this and other PRs I have made related to large query 
plans in my repo: [https://github.com/DaveDeCaprio/spark] - it's the 
closedloop-2.4 branch.

 

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27458) Remind developer using IntelliJ to update maven version

2019-04-17 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27458.
---
   Resolution: Fixed
 Assignee: William Wong
Fix Version/s: 3.0.0

Resolved by https://github.com/apache/spark-website/pull/195

> Remind developer using IntelliJ to update maven version
> ---
>
> Key: SPARK-27458
> URL: https://issues.apache.org/jira/browse/SPARK-27458
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: William Wong
>Assignee: William Wong
>Priority: Minor
> Fix For: 3.0.0
>
>
> I am using IntelliJ to update a few spark source. I tried to follow the guide 
> at '[http://spark.apache.org/developer-tools.html]' to setup an IntelliJ 
> project for Spark. However, the project was failed to build. It was due to 
> missing classes generated via antlr on sql/catalyst project. I tried to click 
> the button 'Generate Sources and Update Folders for all Projects' but it does 
> not help. Antlr task was not triggered as expected.
> Checked the IntelliJ log file and found that it was because I did not set the 
> maven version properly and the 'Generate Sources and Update Folders for all 
> Projects' process was failed silently: 
>  
> _2019-04-14 16:05:24,796 [ 314609]   INFO -      #org.jetbrains.idea.maven - 
> [WARNING] Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion 
> failed with message:_
> _Detected Maven Version: 3.3.9 is not in the allowed range 3.6.0._
> _2019-04-14 16:05:24,813 [ 314626]   INFO -      #org.jetbrains.idea.maven - 
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute 
> goal org.apache.maven.plugins:maven-enforcer-plugin:3.0.0-M2:enforce 
> (enforce-versions) on project spark-parent_2.12: Some Enforcer rules have 
> failed. Look above for specific messages explaining why the rule failed._
> _java.lang.RuntimeException: 
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute 
> goal org.apache.maven.plugins:maven-enforcer-plugin:3.0.0-M2:enforce 
> (enforce-versions) on project spark-parent_2.12: Some Enforcer rules have 
> failed. Look above for specific messages explaining why the rule failed._
>  
> Be honest, failing an action silently should be an IntelliJ bug. However, 
> enhancing the page  '[http://spark.apache.org/developer-tools.html]' to 
> remind developers to check the maven version may save those new joiners some 
> time. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23904) Big execution plan cause OOM

2019-04-17 Thread Izek Greenfield (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820058#comment-16820058
 ] 

Izek Greenfield commented on SPARK-23904:
-

[~DaveDeCaprio] Does that PR go into 2.4.1 release?

> Big execution plan cause OOM
> 
>
> Key: SPARK-23904
> URL: https://issues.apache.org/jira/browse/SPARK-23904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Izek Greenfield
>Priority: Major
>  Labels: SQL, query
>
> I create a question in 
> [StackOverflow|https://stackoverflow.com/questions/49508683/spark-physicalplandescription-string-is-to-big]
>  
> Spark create the text representation of query in any case even if I don't 
> need it.
> That causes many garbage object and unneeded GC... 
>  [Gist with code to 
> reproduce|https://gist.github.com/igreenfield/584c3336f03ba7d63e9026774eaf5e23]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan

2019-04-17 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19712.
-
Resolution: Fixed

Issue resolved by pull request 24331
[https://github.com/apache/spark/pull/24331]

> EXISTS and Left Semi join do not produce the same plan
> --
>
> Key: SPARK-19712
> URL: https://issues.apache.org/jira/browse/SPARK-19712
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nattavut Sutyanyong
>Priority: Major
> Fix For: 3.0.0
>
>
> This problem was found during the development of SPARK-18874.
> The EXISTS form in the following query:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 
> from t3 where t1.t1b=t3.t3b)")}}
> gives the optimized plan below:
> {code}
> == Optimized Logical Plan ==
> Join Inner, (t1a#7 = t2a#25)
> :- Join LeftSemi, (t1b#8 = t3b#58)
> :  :- Filter isnotnull(t1a#7)
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Project [1 AS 1#271, t3b#58]
> : +- Relation[t3a#57,t3b#58,t3c#59] parquet
> +- Filter isnotnull(t2a#25)
>+- Relation[t2a#25,t2b#26,t2c#27] parquet
> {code}
> whereas a semantically equivalent Left Semi join query below:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on 
> t1.t1b=t3.t3b")}}
> gives the following optimized plan:
> {code}
> == Optimized Logical Plan ==
> Join LeftSemi, (t1b#8 = t3b#58)
> :- Join Inner, (t1a#7 = t2a#25)
> :  :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7))
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Filter isnotnull(t2a#25)
> : +- Relation[t2a#25,t2b#26,t2c#27] parquet
> +- Project [t3b#58]
>+- Relation[t3a#57,t3b#58,t3c#59] parquet
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan

2019-04-17 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19712:
---

Assignee: Dilip Biswal

> EXISTS and Left Semi join do not produce the same plan
> --
>
> Key: SPARK-19712
> URL: https://issues.apache.org/jira/browse/SPARK-19712
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Nattavut Sutyanyong
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>
> This problem was found during the development of SPARK-18874.
> The EXISTS form in the following query:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 
> from t3 where t1.t1b=t3.t3b)")}}
> gives the optimized plan below:
> {code}
> == Optimized Logical Plan ==
> Join Inner, (t1a#7 = t2a#25)
> :- Join LeftSemi, (t1b#8 = t3b#58)
> :  :- Filter isnotnull(t1a#7)
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Project [1 AS 1#271, t3b#58]
> : +- Relation[t3a#57,t3b#58,t3c#59] parquet
> +- Filter isnotnull(t2a#25)
>+- Relation[t2a#25,t2b#26,t2c#27] parquet
> {code}
> whereas a semantically equivalent Left Semi join query below:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on 
> t1.t1b=t3.t3b")}}
> gives the following optimized plan:
> {code}
> == Optimized Logical Plan ==
> Join LeftSemi, (t1b#8 = t3b#58)
> :- Join Inner, (t1a#7 = t2a#25)
> :  :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7))
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Filter isnotnull(t2a#25)
> : +- Relation[t2a#25,t2b#26,t2c#27] parquet
> +- Project [t3b#58]
>+- Relation[t3a#57,t3b#58,t3c#59] parquet
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-3438) Support for accessing secured HDFS in Standalone Mode

2019-04-17 Thread Zhanfeng Huo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhanfeng Huo updated SPARK-3438:

Comment: was deleted

(was: This is a newest PR on master with commit 
0d1cc4ae42e1f73538dd8b9b1880ca9e5b124108（Mon Sep 8 14:32:53 2014 +0530）.
1,PR:https://github.com/apache/spark/pull/2320

And this is the original PR that my PR baseed on.
2,PR:https://github.com/apache/spark/pull/265/files)

> Support for accessing secured HDFS in Standalone Mode
> -
>
> Key: SPARK-3438
> URL: https://issues.apache.org/jira/browse/SPARK-3438
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Affects Versions: 1.0.2
>Reporter: Zhanfeng Huo
>Priority: Major
>
> Access to secured HDFS is currently supported in YARN using YARN's built in 
> security mechanism. In YARN mode, a user application is authenticated when it 
> is submitted, then it acquires delegation tokens and them ship them (via 
> YARN) securely to workers.
> In Standalone mode, it would be nice to support a more mechanism for 
> accessing HDFS where we rely on a single shared secret to authenticate 
> communication in the standalone cluster.
> 1. A company is running a standalone cluster.
> 2. They are fine if all Spark jobs in the cluster share a global secret, i.e. 
> all Spark jobs can trust one another.
> 3. They are able to provide a Hadoop login on the driver node via a keytab or 
> kinit. They want tokens from this login to be distributed to the executors to 
> allow access to secure HDFS.
> 4. They also don't want to trust the network on the cluster. I.e. don't want 
> to allow someone to fetch HDFS tokens easily over a known protocol, without 
> authentication.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-3438) Support for accessing secured HDFS in Standalone Mode

2019-04-17 Thread Zhanfeng Huo (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhanfeng Huo updated SPARK-3438:

Comment: was deleted

(was: test)

> Support for accessing secured HDFS in Standalone Mode
> -
>
> Key: SPARK-3438
> URL: https://issues.apache.org/jira/browse/SPARK-3438
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Affects Versions: 1.0.2
>Reporter: Zhanfeng Huo
>Priority: Major
>
> Access to secured HDFS is currently supported in YARN using YARN's built in 
> security mechanism. In YARN mode, a user application is authenticated when it 
> is submitted, then it acquires delegation tokens and them ship them (via 
> YARN) securely to workers.
> In Standalone mode, it would be nice to support a more mechanism for 
> accessing HDFS where we rely on a single shared secret to authenticate 
> communication in the standalone cluster.
> 1. A company is running a standalone cluster.
> 2. They are fine if all Spark jobs in the cluster share a global secret, i.e. 
> all Spark jobs can trust one another.
> 3. They are able to provide a Hadoop login on the driver node via a keytab or 
> kinit. They want tokens from this login to be distributed to the executors to 
> allow access to secure HDFS.
> 4. They also don't want to trust the network on the cluster. I.e. don't want 
> to allow someone to fetch HDFS tokens easily over a known protocol, without 
> authentication.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3438) Support for accessing secured HDFS in Standalone Mode

2019-04-17 Thread Zhanfeng Huo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820005#comment-16820005
 ] 

Zhanfeng Huo commented on SPARK-3438:
-

test

> Support for accessing secured HDFS in Standalone Mode
> -
>
> Key: SPARK-3438
> URL: https://issues.apache.org/jira/browse/SPARK-3438
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Spark Core
>Affects Versions: 1.0.2
>Reporter: Zhanfeng Huo
>Priority: Major
>
> Access to secured HDFS is currently supported in YARN using YARN's built in 
> security mechanism. In YARN mode, a user application is authenticated when it 
> is submitted, then it acquires delegation tokens and them ship them (via 
> YARN) securely to workers.
> In Standalone mode, it would be nice to support a more mechanism for 
> accessing HDFS where we rely on a single shared secret to authenticate 
> communication in the standalone cluster.
> 1. A company is running a standalone cluster.
> 2. They are fine if all Spark jobs in the cluster share a global secret, i.e. 
> all Spark jobs can trust one another.
> 3. They are able to provide a Hadoop login on the driver node via a keytab or 
> kinit. They want tokens from this login to be distributed to the executors to 
> allow access to secure HDFS.
> 4. They also don't want to trust the network on the cluster. I.e. don't want 
> to allow someone to fetch HDFS tokens easily over a known protocol, without 
> authentication.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27430) broadcast hint should be respected for broadcast nested loop join

2019-04-17 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27430.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24376
[https://github.com/apache/spark/pull/24376]

> broadcast hint should be respected for broadcast nested loop join
> -
>
> Key: SPARK-27430
> URL: https://issues.apache.org/jira/browse/SPARK-27430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27487) Spark - Scala 2.12 compatibility

2019-04-17 Thread Vadym Holubnychyi (JIRA)

Vadym Holubnychyi created SPARK-27487:
-

 Summary: Spark - Scala 2.12 compatibility
 Key: SPARK-27487
 URL: https://issues.apache.org/jira/browse/SPARK-27487
 Project: Spark
  Issue Type: Bug
  Components: Build, Deploy
Affects Versions: 2.4.1
 Environment: Scala 2.12.7, Hadoop 2.7.7, Spark 2.4.1.
Reporter: Vadym Holubnychyi


Hi, I've faced one interesting problem during the development. It's written 
that Spark 2.4.1 is compatible with Scala 2.12 (a minor version is not 
specified!). So I tried to deploy an application that was written on Scala 
2.12.7 and I got a lot of serialization errors. Later I checked that Spark had 
been built on Scala 2.12.8 and I switched on it, and everything works well now. 
Isn't it an error that Spark 2.4.1 doesn't support other minor versions?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2019-04-17 Thread Song Jun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819844#comment-16819844
 ] 

Song Jun commented on SPARK-19842:
--

I think Constraint should be designed with DataSource v2 and can do more than 
this jira.

Constraint can be used to:
1. data integrity(not include in this jira)
2. optimizer can use it to rewrite query to gain perfermance(not just PK/FK, 
unique/not null is also useful)

For data integrity, we have two scenarios:
1.1 DataSource native support data integrity, such as mysql/oracle and so on
Spark should only call read/write API of this DataSource, and do nothing 
about data integrity.
1.2 DataSource do not support data integrity, such as csv/json/parquet and so on
Spark can provide data integrity for this DataSource like Hive does(maybe a 
switch can be used to turn it off), and we can discuss to support which kind of 
Constraint.
For example, Hive support PK/FK/UNIQUE(DISABLE RELY)/NOT NUL/DEFAULT, NOT 
NULL ENFORCE check is implement by add an extra UDF 
GenericUDFEnforceNotNullConstraint to the 
Plan(https://issues.apache.org/jira/browse/HIVE-16605).

For Optimizer rewrite query:
2.1 We can add Constraint Information into CatalogTable which is returned by 
catalog.getTable API. Then Optimizer can use it to do query rewrite.
2.2 if we can not get Constraint information, we can use hint to the SQL

Above all, we can bring Constraint feature to DataSource v2 design:
a) to support 2.1 feature, we can add constraint information to 
createTable/alterTable/getTable API in this 
SPIP(https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#)
b) to support data integrity, we can add ConstaintSupport mix-in for DataSource 
v2:
  if one DataSource support Constraint, then Spark do nothing when insert data;
  if one DataSource do not support Constraint but still want to do constraint 
check, then Spark should do the constraint check like Hive(such as not null in 
Hive add a extra udf GenericUDFEnforceNotNullConstraint to the Plan).
  if one DataSource do not support Constraint and do not want to do constraint 
check, then Spark do nothing.


Hive catalog support constraint, we can implement this logic in 
createTable/alterTable API . Then we can use SparkSQL DDL to create Table with 
constraint which stored to HiveMetaStore by Hive catalog API.
for example:CREATE TABLE t(a STRING, b STRING NOT NULL DISABLE, CONSTRAINT pk1 
PRIMARY KEY (a) DISABLE) USING parquet;

As for how to store constraint, because Hive 2.1 has provide constraint API in 
Hive.java, we can call it directly in createTable/alterTable API of Hive 
catalog. There is no need to use table properties to store these
constraint information by Spark. There are some concern for using Hive 2.1 
catalog API directly in the 
docs(https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit#heading=h.lnxbz9),
 such as Spark built-in Hive is 1.2.1, but upgrade Hive to 2.3.4 is 
inprogress(https://issues.apache.org/jira/browse/SPARK-23710).

[~cloud_fan] [~ioana-delaney]
If this proposal is reasonable, please give me some feedback. Thanks!

> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
>Priority: Major
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, 
> constraint

[jira] [Commented] (SPARK-27409) Micro-batch support for Kafka Source in Spark 2.3

2019-04-17 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819809#comment-16819809
 ] 

Gabor Somogyi commented on SPARK-27409:
---

I mean does this cause any data processing issue other than the stack?

> Micro-batch support for Kafka Source in Spark 2.3
> -
>
> Key: SPARK-27409
> URL: https://issues.apache.org/jira/browse/SPARK-27409
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.3.2
>Reporter: Prabhjot Singh Bharaj
>Priority: Major
>
> It seems with this change - 
> [https://github.com/apache/spark/commit/0a441d2edb0a3f6c6c7c370db8917e1c07f211e7#diff-eeac5bdf3a1ecd7b9f8aaf10fff37f05R50]
>  in Spark 2.3 for Kafka Source Provider, a Kafka source can not be run in 
> micro-batch mode but only in continuous mode. Is that understanding correct ?
> {code:java}
> E Py4JJavaError: An error occurred while calling o217.load.
> E : org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:717)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:566)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:549)
> E at 
> org.apache.spark.sql.kafka010.SubscribeStrategy.createConsumer(ConsumerStrategy.scala:62)
> E at 
> org.apache.spark.sql.kafka010.KafkaOffsetReader.createConsumer(KafkaOffsetReader.scala:314)
> E at 
> org.apache.spark.sql.kafka010.KafkaOffsetReader.(KafkaOffsetReader.scala:78)
> E at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130)
> E at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43)
> E at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185)
> E at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> E at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> E at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> E at java.lang.reflect.Method.invoke(Method.java:498)
> E at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> E at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> E at py4j.Gateway.invoke(Gateway.java:282)
> E at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> E at py4j.commands.CallCommand.execute(CallCommand.java:79)
> E at py4j.GatewayConnection.run(GatewayConnection.java:238)
> E at java.lang.Thread.run(Thread.java:748)
> E Caused by: org.apache.kafka.common.KafkaException: 
> org.apache.kafka.common.KafkaException: java.io.FileNotFoundException: 
> non-existent (No such file or directory)
> E at 
> org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:44)
> E at 
> org.apache.kafka.common.network.ChannelBuilders.create(ChannelBuilders.java:93)
> E at 
> org.apache.kafka.common.network.ChannelBuilders.clientChannelBuilder(ChannelBuilders.java:51)
> E at 
> org.apache.kafka.clients.ClientUtils.createChannelBuilder(ClientUtils.java:84)
> E at 
> org.apache.kafka.clients.consumer.KafkaConsumer.(KafkaConsumer.java:657)
> E ... 19 more
> E Caused by: org.apache.kafka.common.KafkaException: 
> java.io.FileNotFoundException: non-existent (No such file or directory)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:121)
> E at 
> org.apache.kafka.common.network.SslChannelBuilder.configure(SslChannelBuilder.java:41)
> E ... 23 more
> E Caused by: java.io.FileNotFoundException: non-existent (No such file or 
> directory)
> E at java.io.FileInputStream.open0(Native Method)
> E at java.io.FileInputStream.open(FileInputStream.java:195)
> E at java.io.FileInputStream.(FileInputStream.java:138)
> E at java.io.FileInputStream.(FileInputStream.java:93)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.load(SslFactory.java:216)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory$SecurityStore.access$000(SslFactory.java:201)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory.createSSLContext(SslFactory.java:137)
> E at 
> org.apache.kafka.common.security.ssl.SslFactory.configure(SslFactory.java:119)
> E ... 24 more{code}
>  When running a simple data stream loader for kafka without an SSL cert, it 
> goes through this code block - 
>  
> {code:java}
> ...
> ...
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:130)
> E at 
> org.apache.spark.sql.kafka010.KafkaSourceProvider.createContinuousReader(KafkaSourceProvider.scala:43)
> E at 
> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:185)
> ...
> ...{code}
>  
> Note

[jira] [Updated] (SPARK-27475) dev/deps/spark-deps-hadoop-3.2 is incorrect

2019-04-17 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27475:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-23710

> dev/deps/spark-deps-hadoop-3.2 is incorrect
> ---
>
> Key: SPARK-27475
> URL: https://issues.apache.org/jira/browse/SPARK-27475
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> parquet-hadoop-bundle-1.6.0.jar should be parquet-hadoop-bundle-1.8.1.jar.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27402) Fix hadoop-3.2 test issue(except the hive-thriftserver module)

2019-04-17 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27402:

Description: Fix sql/core and sql/hive modules test issue for hadoop-3.2

> Fix hadoop-3.2 test issue(except the hive-thriftserver module)
> --
>
> Key: SPARK-27402
> URL: https://issues.apache.org/jira/browse/SPARK-27402
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Fix sql/core and sql/hive modules test issue for hadoop-3.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27402) Fix hadoop-3.2 test issue(except the hive-thriftserver module)

2019-04-17 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27402:

Description: (was: When we upgrade the built-in Hive to 2.3.4, the 
default spark.sql.hive.metastore.version should be 2.3.4. This will not be 
compatible with spark-2.3.3-bin-hadoop2.7.tgz and 
spark-2.4.1-bin-hadoop2.7.tgz.)

> Fix hadoop-3.2 test issue(except the hive-thriftserver module)
> --
>
> Key: SPARK-27402
> URL: https://issues.apache.org/jira/browse/SPARK-27402
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27402) Fix hadoop-3.2 test issue(except the hive-thriftserver module)

2019-04-17 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27402:

Summary: Fix hadoop-3.2 test issue(except the hive-thriftserver module)  
(was: Support HiveExternalCatalog backward compatibility test)

> Fix hadoop-3.2 test issue(except the hive-thriftserver module)
> --
>
> Key: SPARK-27402
> URL: https://issues.apache.org/jira/browse/SPARK-27402
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> When we upgrade the built-in Hive to 2.3.4, the default 
> spark.sql.hive.metastore.version should be 2.3.4. This will not be compatible 
> with spark-2.3.3-bin-hadoop2.7.tgz and spark-2.4.1-bin-hadoop2.7.tgz.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25088) Rest Server default & doc updates

2019-04-17 Thread t oo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819764#comment-16819764
 ] 

t oo commented on SPARK-25088:
--

why block rest if auth is on? for example i want to be able to use unauthed 
rest AND authed standard submission

> Rest Server default & doc updates
> -
>
> Key: SPARK-25088
> URL: https://issues.apache.org/jira/browse/SPARK-25088
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Spark Core
>Affects Versions: 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Major
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> The rest server could use some updates on defaults & docs, both in standalone 
> and mesos.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27486) Enable History server storage information test

2019-04-17 Thread shahid (JIRA)

shahid created SPARK-27486:
--

 Summary: Enable History server storage information test
 Key: SPARK-27486
 URL: https://issues.apache.org/jira/browse/SPARK-27486
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.1, 2.3.3, 3.0.0
Reporter: shahid


After SPARK-22050, we can store the information about block updated events to 
eventLog, if we enable "spark.eventLog.logBlockUpdates.enabled=true". We have 
disabled the test related to storage in the History server suite after 
SPARK-13845. So, we can enable the test, by adding an eventlog corresponding to 
the application, which has enabled "spark.eventLog.logBlockUpdates.enabled=true"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27486) Enable History server storage information test

2019-04-17 Thread shahid (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819759#comment-16819759
 ] 

shahid commented on SPARK-27486:


I will raise a PR

> Enable History server storage information test
> --
>
> Key: SPARK-27486
> URL: https://issues.apache.org/jira/browse/SPARK-27486
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.1, 3.0.0
>Reporter: shahid
>Priority: Minor
>
> After SPARK-22050, we can store the information about block updated events to 
> eventLog, if we enable "spark.eventLog.logBlockUpdates.enabled=true". We have 
> disabled the test related to storage in the History server suite after 
> SPARK-13845. So, we can enable the test, by adding an eventlog corresponding 
> to the application, which has enabled 
> "spark.eventLog.logBlockUpdates.enabled=true"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

75 matches

Mail list logo