[jira] [Resolved] (SPARK-25736) add tests to verify the behavior of multi-column count

2018-10-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25736.
--
   Resolution: Fixed
Fix Version/s: 2.4.0
   3.0.0

Issue resolved by pull request 22728
[https://github.com/apache/spark/pull/22728]

> add tests to verify the behavior of multi-column count
> --
>
> Key: SPARK-25736
> URL: https://issues.apache.org/jira/browse/SPARK-25736
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 3.0.0, 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25740) Set some configuration need invalidateStatsCache

2018-10-16 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-25740:
---

 Summary: Set some configuration need invalidateStatsCache
 Key: SPARK-25740
 URL: https://issues.apache.org/jira/browse/SPARK-25740
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


How to reproduce:
{code:sql}
# spark-sql
create table t1 (a int) stored as parquet;
create table t2 (a int) stored as parquet;
insert into table t1 values (1);
insert into table t2 values (1);
explain select * from t1, t2 where t1.a = t2.a;
exit;

spark-sql
explain select * from t1, t2 where t1.a = t2.a;
-- SortMergeJoin
set spark.sql.statistics.fallBackToHdfs=true;
explain select * from t1, t2 where t1.a = t2.a;
-- SortMergeJoin, it should be BroadcastHashJoin
exit;

spark-sql
set spark.sql.statistics.fallBackToHdfs=true;
explain select * from t1, t2 where t1.a = t2.a;
-- BroadcastHashJoin
{code}
We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do is 
invalidateAllCachedTables when execute set Command:
{code}
val isInvalidateAllCachedTablesKeys = Set(
  SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key,
  SQLConf.DEFAULT_SIZE_IN_BYTES.key
)
sparkSession.conf.set(key, value)
if (isInvalidateAllCachedTablesKeys.contains(key)) {
  sparkSession.sessionState.catalog.invalidateAllCachedTables()
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25740) Set some configuration need invalidateStatsCache

2018-10-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25740:


Assignee: Apache Spark

> Set some configuration need invalidateStatsCache
> 
>
> Key: SPARK-25740
> URL: https://issues.apache.org/jira/browse/SPARK-25740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> How to reproduce:
> {code:sql}
> # spark-sql
> create table t1 (a int) stored as parquet;
> create table t2 (a int) stored as parquet;
> insert into table t1 values (1);
> insert into table t2 values (1);
> explain select * from t1, t2 where t1.a = t2.a;
> exit;
> spark-sql
> explain select * from t1, t2 where t1.a = t2.a;
> -- SortMergeJoin
> set spark.sql.statistics.fallBackToHdfs=true;
> explain select * from t1, t2 where t1.a = t2.a;
> -- SortMergeJoin, it should be BroadcastHashJoin
> exit;
> spark-sql
> set spark.sql.statistics.fallBackToHdfs=true;
> explain select * from t1, t2 where t1.a = t2.a;
> -- BroadcastHashJoin
> {code}
> We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do 
> is invalidateAllCachedTables when execute set Command:
> {code}
> val isInvalidateAllCachedTablesKeys = Set(
>   SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key,
>   SQLConf.DEFAULT_SIZE_IN_BYTES.key
> )
> sparkSession.conf.set(key, value)
> if (isInvalidateAllCachedTablesKeys.contains(key)) {
>   sparkSession.sessionState.catalog.invalidateAllCachedTables()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25740) Set some configuration need invalidateStatsCache

2018-10-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651264#comment-16651264
 ] 

Apache Spark commented on SPARK-25740:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22743

> Set some configuration need invalidateStatsCache
> 
>
> Key: SPARK-25740
> URL: https://issues.apache.org/jira/browse/SPARK-25740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> # spark-sql
> create table t1 (a int) stored as parquet;
> create table t2 (a int) stored as parquet;
> insert into table t1 values (1);
> insert into table t2 values (1);
> explain select * from t1, t2 where t1.a = t2.a;
> exit;
> spark-sql
> explain select * from t1, t2 where t1.a = t2.a;
> -- SortMergeJoin
> set spark.sql.statistics.fallBackToHdfs=true;
> explain select * from t1, t2 where t1.a = t2.a;
> -- SortMergeJoin, it should be BroadcastHashJoin
> exit;
> spark-sql
> set spark.sql.statistics.fallBackToHdfs=true;
> explain select * from t1, t2 where t1.a = t2.a;
> -- BroadcastHashJoin
> {code}
> We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do 
> is invalidateAllCachedTables when execute set Command:
> {code}
> val isInvalidateAllCachedTablesKeys = Set(
>   SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key,
>   SQLConf.DEFAULT_SIZE_IN_BYTES.key
> )
> sparkSession.conf.set(key, value)
> if (isInvalidateAllCachedTablesKeys.contains(key)) {
>   sparkSession.sessionState.catalog.invalidateAllCachedTables()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25740) Set some configuration need invalidateStatsCache

2018-10-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25740:


Assignee: (was: Apache Spark)

> Set some configuration need invalidateStatsCache
> 
>
> Key: SPARK-25740
> URL: https://issues.apache.org/jira/browse/SPARK-25740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> # spark-sql
> create table t1 (a int) stored as parquet;
> create table t2 (a int) stored as parquet;
> insert into table t1 values (1);
> insert into table t2 values (1);
> explain select * from t1, t2 where t1.a = t2.a;
> exit;
> spark-sql
> explain select * from t1, t2 where t1.a = t2.a;
> -- SortMergeJoin
> set spark.sql.statistics.fallBackToHdfs=true;
> explain select * from t1, t2 where t1.a = t2.a;
> -- SortMergeJoin, it should be BroadcastHashJoin
> exit;
> spark-sql
> set spark.sql.statistics.fallBackToHdfs=true;
> explain select * from t1, t2 where t1.a = t2.a;
> -- BroadcastHashJoin
> {code}
> We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do 
> is invalidateAllCachedTables when execute set Command:
> {code}
> val isInvalidateAllCachedTablesKeys = Set(
>   SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key,
>   SQLConf.DEFAULT_SIZE_IN_BYTES.key
> )
> sparkSession.conf.set(key, value)
> if (isInvalidateAllCachedTablesKeys.contains(key)) {
>   sparkSession.sessionState.catalog.invalidateAllCachedTables()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25740) Set some configuration need invalidateStatsCache

2018-10-16 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25740:

Description: 
How to reproduce:
{code:sql}
# spark-sql
create table t1 (a int) stored as parquet;
create table t2 (a int) stored as parquet;
insert into table t1 values (1);
insert into table t2 values (1);
explain select * from t1, t2 where t1.a = t2.a;
exit;

spark-sql
set spark.sql.statistics.fallBackToHdfs=true;
explain select * from t1, t2 where t1.a = t2.a;
-- BroadcastHashJoin
exit;

# spark-sql
explain select * from t1, t2 where t1.a = t2.a;
-- SortMergeJoin
set spark.sql.statistics.fallBackToHdfs=true;
explain select * from t1, t2 where t1.a = t2.a;
-- SortMergeJoin, it should be BroadcastHashJoin
exit;
{code}
We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do is 
invalidateAllCachedTables when execute set Command:
{code:java}
val isInvalidateAllCachedTablesKeys = Set(
  SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key,
  SQLConf.DEFAULT_SIZE_IN_BYTES.key
)
sparkSession.conf.set(key, value)
if (isInvalidateAllCachedTablesKeys.contains(key)) {
  sparkSession.sessionState.catalog.invalidateAllCachedTables()
}
{code}

  was:
How to reproduce:
{code:sql}
# spark-sql
create table t1 (a int) stored as parquet;
create table t2 (a int) stored as parquet;
insert into table t1 values (1);
insert into table t2 values (1);
explain select * from t1, t2 where t1.a = t2.a;
exit;

spark-sql
explain select * from t1, t2 where t1.a = t2.a;
-- SortMergeJoin
set spark.sql.statistics.fallBackToHdfs=true;
explain select * from t1, t2 where t1.a = t2.a;
-- SortMergeJoin, it should be BroadcastHashJoin
exit;

spark-sql
set spark.sql.statistics.fallBackToHdfs=true;
explain select * from t1, t2 where t1.a = t2.a;
-- BroadcastHashJoin
{code}
We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do is 
invalidateAllCachedTables when execute set Command:
{code}
val isInvalidateAllCachedTablesKeys = Set(
  SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key,
  SQLConf.DEFAULT_SIZE_IN_BYTES.key
)
sparkSession.conf.set(key, value)
if (isInvalidateAllCachedTablesKeys.contains(key)) {
  sparkSession.sessionState.catalog.invalidateAllCachedTables()
}
{code}


> Set some configuration need invalidateStatsCache
> 
>
> Key: SPARK-25740
> URL: https://issues.apache.org/jira/browse/SPARK-25740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> # spark-sql
> create table t1 (a int) stored as parquet;
> create table t2 (a int) stored as parquet;
> insert into table t1 values (1);
> insert into table t2 values (1);
> explain select * from t1, t2 where t1.a = t2.a;
> exit;
> spark-sql
> set spark.sql.statistics.fallBackToHdfs=true;
> explain select * from t1, t2 where t1.a = t2.a;
> -- BroadcastHashJoin
> exit;
> # spark-sql
> explain select * from t1, t2 where t1.a = t2.a;
> -- SortMergeJoin
> set spark.sql.statistics.fallBackToHdfs=true;
> explain select * from t1, t2 where t1.a = t2.a;
> -- SortMergeJoin, it should be BroadcastHashJoin
> exit;
> {code}
> We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do 
> is invalidateAllCachedTables when execute set Command:
> {code:java}
> val isInvalidateAllCachedTablesKeys = Set(
>   SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key,
>   SQLConf.DEFAULT_SIZE_IN_BYTES.key
> )
> sparkSession.conf.set(key, value)
> if (isInvalidateAllCachedTablesKeys.contains(key)) {
>   sparkSession.sessionState.catalog.invalidateAllCachedTables()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25740) Set some configuration need invalidateStatsCache

2018-10-16 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25740:

Description: 
How to reproduce:
{code:sql}
# spark-sql
create table t1 (a int) stored as parquet;
create table t2 (a int) stored as parquet;
insert into table t1 values (1);
insert into table t2 values (1);
exit;

spark-sql
set spark.sql.statistics.fallBackToHdfs=true;
explain select * from t1, t2 where t1.a = t2.a;
-- BroadcastHashJoin
exit;

spark-sql
explain select * from t1, t2 where t1.a = t2.a;
-- SortMergeJoin
set spark.sql.statistics.fallBackToHdfs=true;
explain select * from t1, t2 where t1.a = t2.a;
-- SortMergeJoin, it should be BroadcastHashJoin
exit;
{code}
We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do is 
invalidateAllCachedTables when execute set Command:
{code:java}
val isInvalidateAllCachedTablesKeys = Set(
  SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key,
  SQLConf.DEFAULT_SIZE_IN_BYTES.key
)
sparkSession.conf.set(key, value)
if (isInvalidateAllCachedTablesKeys.contains(key)) {
  sparkSession.sessionState.catalog.invalidateAllCachedTables()
}
{code}

  was:
How to reproduce:
{code:sql}
# spark-sql
create table t1 (a int) stored as parquet;
create table t2 (a int) stored as parquet;
insert into table t1 values (1);
insert into table t2 values (1);
explain select * from t1, t2 where t1.a = t2.a;
exit;

spark-sql
set spark.sql.statistics.fallBackToHdfs=true;
explain select * from t1, t2 where t1.a = t2.a;
-- BroadcastHashJoin
exit;

# spark-sql
explain select * from t1, t2 where t1.a = t2.a;
-- SortMergeJoin
set spark.sql.statistics.fallBackToHdfs=true;
explain select * from t1, t2 where t1.a = t2.a;
-- SortMergeJoin, it should be BroadcastHashJoin
exit;
{code}
We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do is 
invalidateAllCachedTables when execute set Command:
{code:java}
val isInvalidateAllCachedTablesKeys = Set(
  SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key,
  SQLConf.DEFAULT_SIZE_IN_BYTES.key
)
sparkSession.conf.set(key, value)
if (isInvalidateAllCachedTablesKeys.contains(key)) {
  sparkSession.sessionState.catalog.invalidateAllCachedTables()
}
{code}


> Set some configuration need invalidateStatsCache
> 
>
> Key: SPARK-25740
> URL: https://issues.apache.org/jira/browse/SPARK-25740
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> # spark-sql
> create table t1 (a int) stored as parquet;
> create table t2 (a int) stored as parquet;
> insert into table t1 values (1);
> insert into table t2 values (1);
> exit;
> spark-sql
> set spark.sql.statistics.fallBackToHdfs=true;
> explain select * from t1, t2 where t1.a = t2.a;
> -- BroadcastHashJoin
> exit;
> spark-sql
> explain select * from t1, t2 where t1.a = t2.a;
> -- SortMergeJoin
> set spark.sql.statistics.fallBackToHdfs=true;
> explain select * from t1, t2 where t1.a = t2.a;
> -- SortMergeJoin, it should be BroadcastHashJoin
> exit;
> {code}
> We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do 
> is invalidateAllCachedTables when execute set Command:
> {code:java}
> val isInvalidateAllCachedTablesKeys = Set(
>   SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key,
>   SQLConf.DEFAULT_SIZE_IN_BYTES.key
> )
> sparkSession.conf.set(key, value)
> if (isInvalidateAllCachedTablesKeys.contains(key)) {
>   sparkSession.sessionState.catalog.invalidateAllCachedTables()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22386) Data Source V2 improvements

2018-10-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22386:
-
Target Version/s: 3.0.0  (was: 2.5.0)

> Data Source V2 improvements
> ---
>
> Key: SPARK-22386
> URL: https://issues.apache.org/jira/browse/SPARK-22386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: releasenotes
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24882) data source v2 API improvement

2018-10-16 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651303#comment-16651303
 ] 

Hyukjin Kwon commented on SPARK-24882:
--

[~cloud_fan], should we put this JIRA under SPARK-22386?

> data source v2 API improvement
> --
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 API, isolate the stateull part of the API, think of better naming 
> of some interfaces. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2018-10-16 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651301#comment-16651301
 ] 

Mridul Muralidharan commented on SPARK-25732:
-

[~vanzin] With long running applications (not necessarily streaming) needing 
access (read/write) to various data sources (not just hdfs), is there a way to 
do this even assuming livy rpc was augmented to support it ? For example, livy 
server would not know which data sources to fetch tokens for (since that will 
be part of user application jars/config).

For the specific usecase [~mgaido] detailed, proxy principal (foo)/keytab would 
be present and distinct from zeppelin or livy principal/keytab.
The 'proxy' part would simply be for livy to submit the application as the 
proxied user 'foo' - once application comes up, it will behave as though it was 
submitted by the user 'foo' with specified keytab (from hdfs) - acquire/renew 
tokens for user 'foo' from its keytab.


[~tgraves] I do share your concern; unfortunately for the usecase Marco is 
targeting, there does not seem to be an alternative; livy server is man in the 
middle here (w.r.t submitting client).

Having said that, if there is an alternative, I would definitely prefer that 
over sharing keytabs - even if it is over secured hdfs.


> Allow specifying a keytab/principal for proxy user for token renewal 
> -
>
> Key: SPARK-25732
> URL: https://issues.apache.org/jira/browse/SPARK-25732
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> As of now, application submitted with proxy-user fail after 2 week due to the 
> lack of token renewal. In order to enable it, we need the the 
> keytab/principal of the impersonated user to be specified, in order to have 
> them available for the token renewal.
> This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
> {{--proxy-user-keytab}}, and the last letting a keytab being specified also 
> in a distributed FS, so that applications can be submitted by servers (eg. 
> Livy, Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25729) It is better to replace `minPartitions` with `defaultParallelism` , when `minPartitions` is less than `defaultParallelism`

2018-10-16 Thread liuxian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuxian updated SPARK-25729:

Description: 
In ‘WholeTextFileRDD’,when `minPartitions` is less than `defaultParallelism`,

it is better to replace `minPartitions` with `defaultParallelism` , because 
this can make better use of resources and improve parallelism.

  was:
In ‘WholeTextFileRDD’,when `minPartitions` is less than `defaultParallelism`,

it is better to replace `minPartitions` with `defaultParallelism` , because 
this can make better use of resources and improve concurrency.


> It is better to replace `minPartitions` with `defaultParallelism` , when 
> `minPartitions` is less than `defaultParallelism`
> --
>
> Key: SPARK-25729
> URL: https://issues.apache.org/jira/browse/SPARK-25729
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> In ‘WholeTextFileRDD’,when `minPartitions` is less than `defaultParallelism`,
> it is better to replace `minPartitions` with `defaultParallelism` , because 
> this can make better use of resources and improve parallelism.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10816) EventTime based sessionization

2018-10-16 Thread Jungtaek Lim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651404#comment-16651404
 ] 

Jungtaek Lim commented on SPARK-10816:
--

Update: I've crafted another performance test for testing same query with data 
pattern.

[https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/tree/benchmarking-SPARK-10816]

 

I've separated packages for both data pattern just for simplicity. Classnames 
are same.

Data pattern 1: plenty of rows in same session

[https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/tree/benchmarking-SPARK-10816/src/main/scala/com/hortonworks/spark/benchmark/streaming/sessionwindow/plenty_of_rows_in_session]

Data pattern 2: plenty of sessions

[https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/tree/benchmarking-SPARK-10816/src/main/scala/com/hortonworks/spark/benchmark/streaming/sessionwindow/plenty_of_sessions]

 

While running benchmark with data pattern 2, I've found some performance hits 
on my patch so made some fixes as well. Most of the fixes were reducing the 
number of codegen: but there's also a major fix: made pre-merging sessions in 
local partition being optional. It seriously harms the performance with data 
pattern 2.

The patch still lacks with state sub-optimal. I guess it is now the major 
bottleneck on my patch, so wrapping my head to find good alternatives. Baidu's 
list state would be the one of, since I realized \[3] might put more deltas as 
well as requires more operations.

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Priority: Major
> Attachments: SPARK-10816 Support session window natively.pdf, Session 
> Window Support For Structure Streaming.pdf
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18127) Add hooks and extension points to Spark

2018-10-16 Thread Ayush Nigam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651427#comment-16651427
 ] 

Ayush Nigam commented on SPARK-18127:
-

Is there any documentation on how to use Hooks for Spark?

> Add hooks and extension points to Spark
> ---
>
> Key: SPARK-18127
> URL: https://issues.apache.org/jira/browse/SPARK-18127
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Srinath
>Assignee: Sameer Agarwal
>Priority: Major
> Fix For: 2.2.0
>
>
> As a Spark user I want to be able to customize my spark session. I currently 
> want to be able to do the following things:
> # I want to be able to add custom analyzer rules. This allows me to implement 
> my own logical constructs; an example of this could be a recursive operator.
> # I want to be able to add my own analysis checks. This allows me to catch 
> problems with spark plans early on. An example of this can be some datasource 
> specific checks.
> # I want to be able to add my own optimizations. This allows me to optimize 
> plans in different ways, for instance when you use a very different cluster 
> (for example a one-node X1 instance). This supersedes the current 
> {{spark.experimental}} methods
> # I want to be able to add my own planning strategies. This supersedes the 
> current {{spark.experimental}} methods. This allows me to plan my own 
> physical plan, an example of this would to plan my own heavily integrated 
> data source (CarbonData for example).
> # I want to be able to use my own customized SQL constructs. An example of 
> this would supporting my own dialect, or be able to add constructs to the 
> current SQL language. I should not have to implement a complete parse, and 
> should be able to delegate to an underlying parser.
> # I want to be able to track modifications and calls to the external catalog. 
> I want this API to be stable. This allows me to do synchronize with other 
> systems.
> This API should modify the SparkSession when the session gets started, and it 
> should NOT change the session in flight.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25741) Long URLs are not rendered properly in web UI

2018-10-16 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-25741:
---
Summary: Long URLs are not rendered properly in web UI  (was: Long URL are 
not rendered properly in web UI)

> Long URLs are not rendered properly in web UI
> -
>
> Key: SPARK-25741
> URL: https://issues.apache.org/jira/browse/SPARK-25741
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> When the URL for description column in the table of job/stage page is long, 
> WebUI doesn't render it properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25741) Long URL are not rendered properly in web UI

2018-10-16 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-25741:
--

 Summary: Long URL are not rendered properly in web UI
 Key: SPARK-25741
 URL: https://issues.apache.org/jira/browse/SPARK-25741
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.4.0
Reporter: Gengliang Wang


When the URL for description column in the table of job/stage page is long, 
WebUI doesn't render it properly.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25741) Long URLs are not rendered properly in web UI

2018-10-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25741:


Assignee: Apache Spark

> Long URLs are not rendered properly in web UI
> -
>
> Key: SPARK-25741
> URL: https://issues.apache.org/jira/browse/SPARK-25741
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Minor
>
> When the URL for description column in the table of job/stage page is long, 
> WebUI doesn't render it properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25741) Long URLs are not rendered properly in web UI

2018-10-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651438#comment-16651438
 ] 

Apache Spark commented on SPARK-25741:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22744

> Long URLs are not rendered properly in web UI
> -
>
> Key: SPARK-25741
> URL: https://issues.apache.org/jira/browse/SPARK-25741
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> When the URL for description column in the table of job/stage page is long, 
> WebUI doesn't render it properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25741) Long URLs are not rendered properly in web UI

2018-10-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651437#comment-16651437
 ] 

Apache Spark commented on SPARK-25741:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22744

> Long URLs are not rendered properly in web UI
> -
>
> Key: SPARK-25741
> URL: https://issues.apache.org/jira/browse/SPARK-25741
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> When the URL for description column in the table of job/stage page is long, 
> WebUI doesn't render it properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25741) Long URLs are not rendered properly in web UI

2018-10-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25741:


Assignee: (was: Apache Spark)

> Long URLs are not rendered properly in web UI
> -
>
> Key: SPARK-25741
> URL: https://issues.apache.org/jira/browse/SPARK-25741
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> When the URL for description column in the table of job/stage page is long, 
> WebUI doesn't render it properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651480#comment-16651480
 ] 

Marco Gaido commented on SPARK-25728:
-

Thanks [~kiszk]. I will check it ASAP, thanks for your work on this.

> SPIP: Structured Intermediate Representation (Tungsten IR) for generating 
> Java code
> ---
>
> Key: SPARK-25728
> URL: https://issues.apache.org/jira/browse/SPARK-25728
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry is to start a discussion about adding structure intermediate 
> representation for generating Java code from a program using DataFrame or 
> Dataset API, in addition to the current String-based representation.
> This addition is based on the discussions in [a 
> thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196].
> Please feel free to comment on this JIRA entry or [Google 
> Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing],
>  too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25742) Is there a way to pass the Azure blob storage credentials to the spark for k8s init-container?

2018-10-16 Thread Oscar Bonilla (JIRA)
Oscar Bonilla created SPARK-25742:
-

 Summary: Is there a way to pass the Azure blob storage credentials 
to the spark for k8s init-container?
 Key: SPARK-25742
 URL: https://issues.apache.org/jira/browse/SPARK-25742
 Project: Spark
  Issue Type: Question
  Components: Kubernetes
Affects Versions: 2.3.2
Reporter: Oscar Bonilla


I'm trying to run spark on a kubernetes cluster in Azure. The idea is to store 
the Spark application jars and dependencies in a container in Azure Blob 
Storage.

I've tried to do this with a public container and this works OK, but when 
having a private Blob Storage container, the spark-init init container doesn't 
download the jars.

The equivalent in AWS S3 is as simple as adding the key_id and secret as 
environment variables, but I don't see how to do this for Azure Blob Storage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25743) New executors are not launched for kubernetes spark thrift on deleting existing executors

2018-10-16 Thread neenu (JIRA)
neenu created SPARK-25743:
-

 Summary: New executors are not launched for kubernetes spark 
thrift on deleting existing executors 
 Key: SPARK-25743
 URL: https://issues.apache.org/jira/browse/SPARK-25743
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.2.0
 Environment: Physical lab configurations.

8 baremetal servers, 
Each 56 Cores, 384GB RAM, RHEL 7.4
Kernel : 3.10.0-862.9.1.el7.x86_64
redhat-release-server.x86_64 7.4-18.el7

 

 

Kubernetes info:

Client Version: version.Info\{Major:"1", Minor:"10", GitVersion:"v1.10.2", 
GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", 
BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", 
Platform:"linux/amd64"}
Server Version: version.Info\{Major:"1", Minor:"10", GitVersion:"v1.10.2", 
GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", 
BuildDate:"2018-04-27T09:10:24Z", GoVersion:"go1.9.3", Compiler:"gc", 
Platform:"linux/amd64"}
Reporter: neenu


Launched spark thrift in kubernetes cluster with dynamic allocation enabled.

Configurations set : 

spark.executor.memory=35g
spark.executor.cores=8
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.executorIdleTimeout=10
spark.dynamicAllocation.cachedExecutorIdleTimeout=15
spark.driver.memory=10g
spark.driver.cores=4
spark.sql.crossJoin.enabled=true
spark.sql.starJoinOptimization=true
spark.sql.codegen=true
spark.rpc.numRetries=5
spark.rpc.retry.wait=5
spark.sql.broadcastTimeout=1200
spark.network.timeout=1800
spark.dynamicAllocation.maxExecutors=15
spark.kubernetes.allocation.batch.size=2
spark.kubernetes.allocation.batch.delay=9
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kubernetes.node.selector.is_control=false

Tried to run TPCDS queries , on a 1TB parquet snappy data . 

Found that as the execution progress, the tasks are done by a single executor ( 
executor 53 ) and no new executors are getting spawned, even though there is 
enough resources to spawn more executors.

 

Tried to manually delete the executor pod 53 and saw that no new executor has 
been spawned to replace the one which is running.

Attcahed the 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25743) New executors are not launched for kubernetes spark thrift on deleting existing executors

2018-10-16 Thread neenu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

neenu updated SPARK-25743:
--
Attachment: query_0_correct.sql

> New executors are not launched for kubernetes spark thrift on deleting 
> existing executors 
> --
>
> Key: SPARK-25743
> URL: https://issues.apache.org/jira/browse/SPARK-25743
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.2.0
> Environment: Physical lab configurations.
> 8 baremetal servers, 
> Each 56 Cores, 384GB RAM, RHEL 7.4
> Kernel : 3.10.0-862.9.1.el7.x86_64
> redhat-release-server.x86_64 7.4-18.el7
>  
>  
> Kubernetes info:
> Client Version: version.Info\{Major:"1", Minor:"10", GitVersion:"v1.10.2", 
> GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", 
> BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info\{Major:"1", Minor:"10", GitVersion:"v1.10.2", 
> GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", 
> BuildDate:"2018-04-27T09:10:24Z", GoVersion:"go1.9.3", Compiler:"gc", 
> Platform:"linux/amd64"}
>Reporter: neenu
>Priority: Major
> Attachments: driver.log, query_0_correct.sql
>
>
> Launched spark thrift in kubernetes cluster with dynamic allocation enabled.
> Configurations set : 
> spark.executor.memory=35g
> spark.executor.cores=8
> spark.dynamicAllocation.enabled=true
> spark.dynamicAllocation.executorIdleTimeout=10
> spark.dynamicAllocation.cachedExecutorIdleTimeout=15
> spark.driver.memory=10g
> spark.driver.cores=4
> spark.sql.crossJoin.enabled=true
> spark.sql.starJoinOptimization=true
> spark.sql.codegen=true
> spark.rpc.numRetries=5
> spark.rpc.retry.wait=5
> spark.sql.broadcastTimeout=1200
> spark.network.timeout=1800
> spark.dynamicAllocation.maxExecutors=15
> spark.kubernetes.allocation.batch.size=2
> spark.kubernetes.allocation.batch.delay=9
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> spark.kubernetes.node.selector.is_control=false
> Tried to run TPCDS queries , on a 1TB parquet snappy data . 
> Found that as the execution progress, the tasks are done by a single executor 
> ( executor 53 ) and no new executors are getting spawned, even though there 
> is enough resources to spawn more executors.
>  
> Tried to manually delete the executor pod 53 and saw that no new executor has 
> been spawned to replace the one which is running.
> Attcahed the 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25743) New executors are not launched for kubernetes spark thrift on deleting existing executors

2018-10-16 Thread neenu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651514#comment-16651514
 ] 

neenu commented on SPARK-25743:
---

Also attached the tpcds query list executed [^query_0_correct.sql]

> New executors are not launched for kubernetes spark thrift on deleting 
> existing executors 
> --
>
> Key: SPARK-25743
> URL: https://issues.apache.org/jira/browse/SPARK-25743
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.2.0
> Environment: Physical lab configurations.
> 8 baremetal servers, 
> Each 56 Cores, 384GB RAM, RHEL 7.4
> Kernel : 3.10.0-862.9.1.el7.x86_64
> redhat-release-server.x86_64 7.4-18.el7
>  
>  
> Kubernetes info:
> Client Version: version.Info\{Major:"1", Minor:"10", GitVersion:"v1.10.2", 
> GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", 
> BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info\{Major:"1", Minor:"10", GitVersion:"v1.10.2", 
> GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", 
> BuildDate:"2018-04-27T09:10:24Z", GoVersion:"go1.9.3", Compiler:"gc", 
> Platform:"linux/amd64"}
>Reporter: neenu
>Priority: Major
> Attachments: driver.log
>
>
> Launched spark thrift in kubernetes cluster with dynamic allocation enabled.
> Configurations set : 
> spark.executor.memory=35g
> spark.executor.cores=8
> spark.dynamicAllocation.enabled=true
> spark.dynamicAllocation.executorIdleTimeout=10
> spark.dynamicAllocation.cachedExecutorIdleTimeout=15
> spark.driver.memory=10g
> spark.driver.cores=4
> spark.sql.crossJoin.enabled=true
> spark.sql.starJoinOptimization=true
> spark.sql.codegen=true
> spark.rpc.numRetries=5
> spark.rpc.retry.wait=5
> spark.sql.broadcastTimeout=1200
> spark.network.timeout=1800
> spark.dynamicAllocation.maxExecutors=15
> spark.kubernetes.allocation.batch.size=2
> spark.kubernetes.allocation.batch.delay=9
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> spark.kubernetes.node.selector.is_control=false
> Tried to run TPCDS queries , on a 1TB parquet snappy data . 
> Found that as the execution progress, the tasks are done by a single executor 
> ( executor 53 ) and no new executors are getting spawned, even though there 
> is enough resources to spawn more executors.
>  
> Tried to manually delete the executor pod 53 and saw that no new executor has 
> been spawned to replace the one which is running.
> Attcahed the 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25743) New executors are not launched for kubernetes spark thrift on deleting existing executors

2018-10-16 Thread neenu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

neenu updated SPARK-25743:
--
Attachment: driver.log

> New executors are not launched for kubernetes spark thrift on deleting 
> existing executors 
> --
>
> Key: SPARK-25743
> URL: https://issues.apache.org/jira/browse/SPARK-25743
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.2.0
> Environment: Physical lab configurations.
> 8 baremetal servers, 
> Each 56 Cores, 384GB RAM, RHEL 7.4
> Kernel : 3.10.0-862.9.1.el7.x86_64
> redhat-release-server.x86_64 7.4-18.el7
>  
>  
> Kubernetes info:
> Client Version: version.Info\{Major:"1", Minor:"10", GitVersion:"v1.10.2", 
> GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", 
> BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info\{Major:"1", Minor:"10", GitVersion:"v1.10.2", 
> GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", 
> BuildDate:"2018-04-27T09:10:24Z", GoVersion:"go1.9.3", Compiler:"gc", 
> Platform:"linux/amd64"}
>Reporter: neenu
>Priority: Major
> Attachments: driver.log, query_0_correct.sql
>
>
> Launched spark thrift in kubernetes cluster with dynamic allocation enabled.
> Configurations set : 
> spark.executor.memory=35g
> spark.executor.cores=8
> spark.dynamicAllocation.enabled=true
> spark.dynamicAllocation.executorIdleTimeout=10
> spark.dynamicAllocation.cachedExecutorIdleTimeout=15
> spark.driver.memory=10g
> spark.driver.cores=4
> spark.sql.crossJoin.enabled=true
> spark.sql.starJoinOptimization=true
> spark.sql.codegen=true
> spark.rpc.numRetries=5
> spark.rpc.retry.wait=5
> spark.sql.broadcastTimeout=1200
> spark.network.timeout=1800
> spark.dynamicAllocation.maxExecutors=15
> spark.kubernetes.allocation.batch.size=2
> spark.kubernetes.allocation.batch.delay=9
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> spark.kubernetes.node.selector.is_control=false
> Tried to run TPCDS queries , on a 1TB parquet snappy data . 
> Found that as the execution progress, the tasks are done by a single executor 
> ( executor 53 ) and no new executors are getting spawned, even though there 
> is enough resources to spawn more executors.
>  
> Tried to manually delete the executor pod 53 and saw that no new executor has 
> been spawned to replace the one which is running.
> Attcahed the 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25744) Allow kubernetes integration tests to be run against a real cluster.

2018-10-16 Thread Prashant Sharma (JIRA)
Prashant Sharma created SPARK-25744:
---

 Summary: Allow kubernetes integration tests to be run against a 
real cluster.
 Key: SPARK-25744
 URL: https://issues.apache.org/jira/browse/SPARK-25744
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Prashant Sharma


Currently, tests can only run against a minikube cluster, testing against a 
real cluster gives more flexibility in writing tests with more number of 
executors and resources.

It will also be helpful, if minikube is unavailable for testing.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651546#comment-16651546
 ] 

Apache Spark commented on SPARK-21402:
--

User 'vofque' has created a pull request for this issue:
https://github.com/apache/spark/pull/22745

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList

2018-10-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651550#comment-16651550
 ] 

Apache Spark commented on SPARK-21402:
--

User 'vofque' has created a pull request for this issue:
https://github.com/apache/spark/pull/22745

> Java encoders - switch fields on collectAsList
> --
>
> Key: SPARK-21402
> URL: https://issues.apache.org/jira/browse/SPARK-21402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: mac os
> spark 2.1.1
> Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121
>Reporter: Tom
>Priority: Major
>
> I have the following schema in a dataset -
> root
>  |-- userId: string (nullable = true)
>  |-- data: map (nullable = true)
>  ||-- key: string
>  ||-- value: struct (valueContainsNull = true)
>  |||-- startTime: long (nullable = true)
>  |||-- endTime: long (nullable = true)
>  |-- offset: long (nullable = true)
>  And I have the following classes (+ setter and getters which I omitted for 
> simplicity) -
>  
> {code:java}
> public class MyClass {
> private String userId;
> private Map data;
> private Long offset;
>  }
> public class MyDTO {
> private long startTime;
> private long endTime;
> }
> {code}
> I collect the result the following way - 
> {code:java}
> Encoder myClassEncoder = Encoders.bean(MyClass.class);
> Dataset results = raw_df.as(myClassEncoder);
> List lst = results.collectAsList();
> {code}
> 
> I do several calculations to get the result I want and the result is correct 
> all through the way before I collect it.
> This is the result for - 
> {code:java}
> results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false);
> {code}
> |data[2017-07-01].startTime|data[2017-07-01].endTime|
> +-+--+
> |1498854000|1498870800  |
> This is the result after collecting the reuslts for - 
> {code:java}
> MyClass userData = results.collectAsList().get(0);
> MyDTO userDTO = userData.getData().get("2017-07-01");
> System.out.println("userDTO startTime: " + userDTO.getStartTime());
> System.out.println("userDTO endTime: " + userDTO.getEndTime());
> {code}
> --
> data startTime: 1498870800
> data endTime: 1498854000
> I tend to believe it is a spark issue. Would love any suggestions on how to 
> bypass it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18832) Spark SQL: Thriftserver unable to run a registered Hive UDTF

2018-10-16 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651565#comment-16651565
 ] 

Yuming Wang commented on SPARK-18832:
-

[~roadster11x] I can't reproduce.

> Spark SQL: Thriftserver unable to run a registered Hive UDTF
> 
>
> Key: SPARK-18832
> URL: https://issues.apache.org/jira/browse/SPARK-18832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: HDP: 2.5
> Spark: 2.0.0
>Reporter: Lokesh Yadav
>Priority: Major
> Attachments: SampleUDTF.java
>
>
> Spark Thriftserver is unable to run a HiveUDTF.
> It throws the error that it is unable to find the functions although the 
> function registration succeeds and the funtions does show up in the list 
> output by {{show functions}}.
> I am using a Hive UDTF, registering it using a jar placed on my local 
> machine. Calling it using the following commands:
> //Registering the functions, this command succeeds.
> {{CREATE FUNCTION SampleUDTF AS 
> 'com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF' USING JAR 
> '/root/spark_files/experiments-1.2.jar';}}
> //Thriftserver is able to look up the functuion, on this command:
> {{DESCRIBE FUNCTION SampleUDTF;}}
> {quote}
> {noformat}
> Output: 
> +---+--+
> |   function_desc   |
> +---+--+
> | Function: default.SampleUDTF  |
> | Class: com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF  |
> | Usage: N/A.   |
> +---+--+
> {noformat}
> {quote}
> // Calling the function: 
> {{SELECT SampleUDTF('Paris');}}
> bq. Output of the above command: Error: 
> org.apache.spark.sql.AnalysisException: Undefined function: 'SampleUDTF'. 
> This function is neither a registered temporary function nor a permanent 
> function registered in the database 'default'.; line 1 pos 7 (state=,code=0)
> I have also tried with using a non-local (on hdfs) jar, but I get the same 
> error.
> My environment: HDP 2.5 with spark 2.0.0
> I have attached the class file for the UDTF I am using in testing this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24882) data source v2 API improvement

2018-10-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24882:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-22386

> data source v2 API improvement
> --
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.4.0
>
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 API, isolate the stateull part of the API, think of better naming 
> of some interfaces. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24882) data source v2 API improvement

2018-10-16 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-24882:

Fix Version/s: (was: 2.4.0)
   3.0.0

> data source v2 API improvement
> --
>
> Key: SPARK-24882
> URL: https://issues.apache.org/jira/browse/SPARK-24882
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> Data source V2 is out for a while, see the SPIP 
> [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing].
>  We have already migrated most of the built-in streaming data sources to the 
> V2 API, and the file source migration is in progress. During the migration, 
> we found several problems and want to address them before we stabilize the V2 
> API.
> To solve these problems, we need to separate responsibilities in the data 
> source v2 API, isolate the stateull part of the API, think of better naming 
> of some interfaces. Details please see the attached google doc: 
> https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24499) Documentation improvement of Spark core and SQL

2018-10-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24499:


Assignee: Apache Spark

> Documentation improvement of Spark core and SQL
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24499) Documentation improvement of Spark core and SQL

2018-10-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651598#comment-16651598
 ] 

Apache Spark commented on SPARK-24499:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/22746

> Documentation improvement of Spark core and SQL
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24499) Documentation improvement of Spark core and SQL

2018-10-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24499:


Assignee: (was: Apache Spark)

> Documentation improvement of Spark core and SQL
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25579) Use quoted attribute names if needed in pushed ORC predicates

2018-10-16 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25579.
--
   Resolution: Fixed
Fix Version/s: 2.4.0
   3.0.0

Issue resolved by pull request 22597
[https://github.com/apache/spark/pull/22597]

> Use quoted attribute names if needed in pushed ORC predicates
> -
>
> Key: SPARK-25579
> URL: https://issues.apache.org/jira/browse/SPARK-25579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Blocker
> Fix For: 3.0.0, 2.4.0
>
>
> This issue aims to fix an ORC performance regression at Spark 2.4.0 RCs from 
> Spark 2.3.2. For column names with `.`, the pushed predicates are ignored.
> *Test Data*
> {code:java}
> scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot")
> scala> df.write.mode("overwrite").orc("/tmp/orc")
> {code}
> *Spark 2.3.2*
> {code:java}
> scala> spark.sql("set spark.sql.orc.impl=native")
> scala> spark.sql("set spark.sql.orc.filterPushdown=true")
> scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 
> 10").show)
> ++
> |col.with.dot|
> ++
> |   1|
> |   8|
> ++
> Time taken: 1486 ms
> scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 
> 10").show)
> ++
> |col.with.dot|
> ++
> |   1|
> |   8|
> ++
> Time taken: 163 ms
> {code}
> *Spark 2.4.0 RC2*
> {code:java}
> scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 
> 10").show)
> ++
> |col.with.dot|
> ++
> |   1|
> |   8|
> ++
> Time taken: 4087 ms
> scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 
> 10").show)
> ++
> |col.with.dot|
> ++
> |   1|
> |   8|
> ++
> Time taken: 1998 ms
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25745) docker-image-tool.sh ignores errors from Docker

2018-10-16 Thread Rob Vesse (JIRA)
Rob Vesse created SPARK-25745:
-

 Summary: docker-image-tool.sh ignores errors from Docker
 Key: SPARK-25745
 URL: https://issues.apache.org/jira/browse/SPARK-25745
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Kubernetes
Affects Versions: 2.3.2, 2.3.1, 2.3.0
Reporter: Rob Vesse


In attempting to use the {{docker-image-tool.sh}} scripts to build some custom 
Dockerfiles I ran into issues with the scripts interaction with Docker.  Most 
notably if the Docker build/push fails the script continues blindly ignoring 
the errors.  This can either result in complete failure to build or lead to 
subtle bugs where images are built against different base images than expected.

Additionally while the Dockerfiles assume that Spark is first built locally the 
scripts fail to validate this which they could easily do by checking the 
expected JARs location.  This can also lead to failed Docker builds which could 
easily be avoided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24902) Add integration tests for PVs

2018-10-16 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651621#comment-16651621
 ] 

Stavros Kontopoulos commented on SPARK-24902:
-

This is blocked by this issue: 
https://github.com/fabric8io/kubernetes-client/issues/1234

> Add integration tests for PVs
> -
>
> Key: SPARK-24902
> URL: https://issues.apache.org/jira/browse/SPARK-24902
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> PVs and hostpath support has been added recently 
> (https://github.com/apache/spark/pull/21260/files) for Spark on K8s. 
> We should have some integration tests based on local storage. 
> It is easy to add PVs to minikube and attatch them to pods.
> We could target a known dir like /tmp for the PV path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24902) Add integration tests for PVs

2018-10-16 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651621#comment-16651621
 ] 

Stavros Kontopoulos edited comment on SPARK-24902 at 10/16/18 12:50 PM:


This is blocked by this issue: 
[https://github.com/fabric8io/kubernetes-client/issues/1234]

Target is to use: https://github.com/fabric8io/kubernetes-client/issues/1234


was (Author: skonto):
This is blocked by this issue: 
https://github.com/fabric8io/kubernetes-client/issues/1234

> Add integration tests for PVs
> -
>
> Key: SPARK-24902
> URL: https://issues.apache.org/jira/browse/SPARK-24902
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> PVs and hostpath support has been added recently 
> (https://github.com/apache/spark/pull/21260/files) for Spark on K8s. 
> We should have some integration tests based on local storage. 
> It is easy to add PVs to minikube and attatch them to pods.
> We could target a known dir like /tmp for the PV path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24902) Add integration tests for PVs

2018-10-16 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651621#comment-16651621
 ] 

Stavros Kontopoulos edited comment on SPARK-24902 at 10/16/18 12:50 PM:


This is blocked by this issue: 
[https://github.com/fabric8io/kubernetes-client/issues/1234]

Target is to use: 
https://kubernetes.io/blog/2018/04/13/local-persistent-volumes-beta


was (Author: skonto):
This is blocked by this issue: 
[https://github.com/fabric8io/kubernetes-client/issues/1234]

Target is to use: https://github.com/fabric8io/kubernetes-client/issues/1234

> Add integration tests for PVs
> -
>
> Key: SPARK-24902
> URL: https://issues.apache.org/jira/browse/SPARK-24902
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Priority: Minor
>
> PVs and hostpath support has been added recently 
> (https://github.com/apache/spark/pull/21260/files) for Spark on K8s. 
> We should have some integration tests based on local storage. 
> It is easy to add PVs to minikube and attatch them to pods.
> We could target a known dir like /tmp for the PV path.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24499) Documentation improvement of Spark core and SQL

2018-10-16 Thread Yuanjian Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651628#comment-16651628
 ] 

Yuanjian Li commented on SPARK-24499:
-

{quote}
Is this just about splitting up the docs?
{quote}
If I understand correctly, splitting is the first step.
{quote}
 let's break that down a little further into JIRAs.
{quote}
No problem, we'll do the follow up work with creating sub-task for this JIRA.

> Documentation improvement of Spark core and SQL
> ---
>
> Key: SPARK-24499
> URL: https://issues.apache.org/jira/browse/SPARK-24499
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> The current documentation in Apache Spark lacks enough code examples and 
> tips. If needed, we should also split the page of 
> https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple 
> separate pages like what we did for 
> https://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25745) docker-image-tool.sh ignores errors from Docker

2018-10-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651630#comment-16651630
 ] 

Apache Spark commented on SPARK-25745:
--

User 'rvesse' has created a pull request for this issue:
https://github.com/apache/spark/pull/22748

> docker-image-tool.sh ignores errors from Docker
> ---
>
> Key: SPARK-25745
> URL: https://issues.apache.org/jira/browse/SPARK-25745
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Rob Vesse
>Priority: Major
>
> In attempting to use the {{docker-image-tool.sh}} scripts to build some 
> custom Dockerfiles I ran into issues with the scripts interaction with 
> Docker.  Most notably if the Docker build/push fails the script continues 
> blindly ignoring the errors.  This can either result in complete failure to 
> build or lead to subtle bugs where images are built against different base 
> images than expected.
> Additionally while the Dockerfiles assume that Spark is first built locally 
> the scripts fail to validate this which they could easily do by checking the 
> expected JARs location.  This can also lead to failed Docker builds which 
> could easily be avoided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25745) docker-image-tool.sh ignores errors from Docker

2018-10-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25745:


Assignee: (was: Apache Spark)

> docker-image-tool.sh ignores errors from Docker
> ---
>
> Key: SPARK-25745
> URL: https://issues.apache.org/jira/browse/SPARK-25745
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Rob Vesse
>Priority: Major
>
> In attempting to use the {{docker-image-tool.sh}} scripts to build some 
> custom Dockerfiles I ran into issues with the scripts interaction with 
> Docker.  Most notably if the Docker build/push fails the script continues 
> blindly ignoring the errors.  This can either result in complete failure to 
> build or lead to subtle bugs where images are built against different base 
> images than expected.
> Additionally while the Dockerfiles assume that Spark is first built locally 
> the scripts fail to validate this which they could easily do by checking the 
> expected JARs location.  This can also lead to failed Docker builds which 
> could easily be avoided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25745) docker-image-tool.sh ignores errors from Docker

2018-10-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25745:


Assignee: Apache Spark

> docker-image-tool.sh ignores errors from Docker
> ---
>
> Key: SPARK-25745
> URL: https://issues.apache.org/jira/browse/SPARK-25745
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Rob Vesse
>Assignee: Apache Spark
>Priority: Major
>
> In attempting to use the {{docker-image-tool.sh}} scripts to build some 
> custom Dockerfiles I ran into issues with the scripts interaction with 
> Docker.  Most notably if the Docker build/push fails the script continues 
> blindly ignoring the errors.  This can either result in complete failure to 
> build or lead to subtle bugs where images are built against different base 
> images than expected.
> Additionally while the Dockerfiles assume that Spark is first built locally 
> the scripts fail to validate this which they could easily do by checking the 
> expected JARs location.  This can also lead to failed Docker builds which 
> could easily be avoided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25745) docker-image-tool.sh ignores errors from Docker

2018-10-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651631#comment-16651631
 ] 

Apache Spark commented on SPARK-25745:
--

User 'rvesse' has created a pull request for this issue:
https://github.com/apache/spark/pull/22748

> docker-image-tool.sh ignores errors from Docker
> ---
>
> Key: SPARK-25745
> URL: https://issues.apache.org/jira/browse/SPARK-25745
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Rob Vesse
>Priority: Major
>
> In attempting to use the {{docker-image-tool.sh}} scripts to build some 
> custom Dockerfiles I ran into issues with the scripts interaction with 
> Docker.  Most notably if the Docker build/push fails the script continues 
> blindly ignoring the errors.  This can either result in complete failure to 
> build or lead to subtle bugs where images are built against different base 
> images than expected.
> Additionally while the Dockerfiles assume that Spark is first built locally 
> the scripts fail to validate this which they could easily do by checking the 
> expected JARs location.  This can also lead to failed Docker builds which 
> could easily be avoided.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2018-10-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651657#comment-16651657
 ] 

Thomas Graves commented on SPARK-25732:
---

So like Marcelo mentioned can't you re-use the keytab/principal option already 
there?  It might need slightly modified to pull from HDFS but that is really 
what this is doing, its just livy is submitting the job for you.  Really the 
user could specify it when submitting the job as a conf (? I guess depends on 
who is calling livy, jupyter for instance definitely could as user can pass 
configs).  I would prefer that over adding more configs.

There are lots of cases things are in the middle of job submission, livy, 
oozie, other workflow managers.  I don't see that as a reason not to do tokens. 
 User should know they are submitting jobs (especially one that runs for 2 
weeks) and until we have a good automated solution, they would have to setup 
cron or something else to push tokens before they expire.  I know the YARN 
folks were looking at options to help with this but haven't synced with them 
lately as ideally there would be a way to push the tokens to the RM for it to 
continue to renew so you would only have to do it before max lifetime.   Its 
easy enough to write a script that runs and does a list of applications running 
for the user and push tokens to each of those, assuming we had spark-submit 
option to push tokens.

> Allow specifying a keytab/principal for proxy user for token renewal 
> -
>
> Key: SPARK-25732
> URL: https://issues.apache.org/jira/browse/SPARK-25732
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> As of now, application submitted with proxy-user fail after 2 week due to the 
> lack of token renewal. In order to enable it, we need the the 
> keytab/principal of the impersonated user to be specified, in order to have 
> them available for the token renewal.
> This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
> {{--proxy-user-keytab}}, and the last letting a keytab being specified also 
> in a distributed FS, so that applications can be submitted by servers (eg. 
> Livy, Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25705) Remove Kafka 0.8 support in Spark 3.0.0

2018-10-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25705:
-

Assignee: Sean Owen

> Remove Kafka 0.8 support in Spark 3.0.0
> ---
>
> Key: SPARK-25705
> URL: https://issues.apache.org/jira/browse/SPARK-25705
> Project: Spark
>  Issue Type: Task
>  Components: Build, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
> Fix For: 3.0.0
>
>
> In the move to Spark 3.0, we're talking about removing support for old legacy 
> connectors and versions of them. Kafka is about 2-3 major versions (depending 
> on how you count it) on from Kafka 0.8, and should likely be retired.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25705) Remove Kafka 0.8 support in Spark 3.0.0

2018-10-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25705.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22703
[https://github.com/apache/spark/pull/22703]

> Remove Kafka 0.8 support in Spark 3.0.0
> ---
>
> Key: SPARK-25705
> URL: https://issues.apache.org/jira/browse/SPARK-25705
> Project: Spark
>  Issue Type: Task
>  Components: Build, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
> Fix For: 3.0.0
>
>
> In the move to Spark 3.0, we're talking about removing support for old legacy 
> connectors and versions of them. Kafka is about 2-3 major versions (depending 
> on how you count it) on from Kafka 0.8, and should likely be retired.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-16 Thread Brian Jones (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Jones updated SPARK-25739:

Flags: Important  (was: Patch,Important)

> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.
>  
> I have tested the provided code in Databricks runtime environment 5.0 and 
> 4.1, and it is giving the expected output.   However in Databricks runtime 
> 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null

2018-10-16 Thread Brian Jones (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brian Jones updated SPARK-25739:

Flags: Patch,Important  (was: Important)

> Double quote coming in as empty value even when emptyValue set as null
> --
>
> Key: SPARK-25739
> URL: https://issues.apache.org/jira/browse/SPARK-25739
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
> Environment:  Databricks - 4.2 (includes Apache Spark 2.3.1, Scala 
> 2.11) 
>Reporter: Brian Jones
>Priority: Major
>
>  Example code - 
> {code:java}
> val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value")
> df
> .repartition(1)
> .write
> .mode("overwrite")
> .option("nullValue", null)
> .option("emptyValue", null)
> .option("delimiter",",")
> .option("quoteMode", "NONE")
> .option("escape","\\")
> .format("csv")
> .save("/tmp/nullcsv/")
> var out = dbutils.fs.ls("/tmp/nullcsv/")
> var file = out(out.size - 1)
> val x = dbutils.fs.head("/tmp/nullcsv/" + file.name)
> println(x)
> {code}
> Output - 
> {code:java}
> 1,""
> 3,hi
> 2,hello
> 4,
> {code}
> Expected output - 
> {code:java}
> 1,
> 3,hi
> 2,hello
> 4,
> {code}
>  
> [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe]
>  This commit is relevant to my issue.
> "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In 
> version 2.3 and earlier, empty strings are equal to `null` values and do not 
> reflect to any characters in saved CSV files."
> I am on Spark version 2.3.1, so empty strings should be coming as null.  Even 
> then, I am passing the correct "emptyValue" option.  However, my empty values 
> are stilling coming as `""` in the written file.
>  
> I have tested the provided code in Databricks runtime environment 5.0 and 
> 4.1, and it is giving the expected output.   However in Databricks runtime 
> 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2018-10-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651800#comment-16651800
 ] 

Marco Gaido commented on SPARK-25732:
-

[~tgraves] I think they can be reused, the point is that it may be confusing 
that:
{code}
kinit -kt super.keytab su...@example.com
spark-submit --principal a...@example.com --keytab hdfs:///a.keytab 
--proxy-user a
{code}
runs with user {{super}} impersonating user {{a}}, while
{code}
kinit -kt super.keytab su...@example.com
spark-submit --principal a...@example.com --keytab hdfs:///a.keytab
{code}
runs with user {{a}}. So the reason why I was proposing different configs is 
for clarity of the end user.

I think the other point is that giving to the external systems the 
responsibility of pushing tokens can cause an indefinite number of issues and 
it is going to be hard to understand where the responsibility is. Centralizing 
the responsibility in Spark, would allow all these intermediate systems to work 
properly without any issue. 


> Allow specifying a keytab/principal for proxy user for token renewal 
> -
>
> Key: SPARK-25732
> URL: https://issues.apache.org/jira/browse/SPARK-25732
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> As of now, application submitted with proxy-user fail after 2 week due to the 
> lack of token renewal. In order to enable it, we need the the 
> keytab/principal of the impersonated user to be specified, in order to have 
> them available for the token renewal.
> This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
> {{--proxy-user-keytab}}, and the last letting a keytab being specified also 
> in a distributed FS, so that applications can be submitted by servers (eg. 
> Livy, Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2018-10-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651827#comment-16651827
 ] 

Thomas Graves commented on SPARK-25732:
---

yeah I understand the concern, we don't want to confuse user if we can help it. 
The 2 command above are the same except the --proxy-user and in my opinion, I 
don't think it would be confusing to the user, you pass in the keytab and 
principal for the user who's credentials you want refreshed, in this case its 
user "a" in both cases.  Seems like making sure docs are clear should make it 
clear to the users.  I assume most users submitting via livy don't realize they 
are using livy and being launched as proxy-user.  So user would just specify 
keytab/principal configs based on their own user.

 

> Allow specifying a keytab/principal for proxy user for token renewal 
> -
>
> Key: SPARK-25732
> URL: https://issues.apache.org/jira/browse/SPARK-25732
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> As of now, application submitted with proxy-user fail after 2 week due to the 
> lack of token renewal. In order to enable it, we need the the 
> keytab/principal of the impersonated user to be specified, in order to have 
> them available for the token renewal.
> This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
> {{--proxy-user-keytab}}, and the last letting a keytab being specified also 
> in a distributed FS, so that applications can be submitted by servers (eg. 
> Livy, Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2018-10-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651840#comment-16651840
 ] 

Thomas Graves commented on SPARK-25732:
---

sorry just realized I misread the second one.  why would you run the second 
command?

I would actually expect that to fail or run as the super user unless it 
downloaded the keytab and kinit'd on submission before it did anything with 
hdfs, etc.

> Allow specifying a keytab/principal for proxy user for token renewal 
> -
>
> Key: SPARK-25732
> URL: https://issues.apache.org/jira/browse/SPARK-25732
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> As of now, application submitted with proxy-user fail after 2 week due to the 
> lack of token renewal. In order to enable it, we need the the 
> keytab/principal of the impersonated user to be specified, in order to have 
> them available for the token renewal.
> This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
> {{--proxy-user-keytab}}, and the last letting a keytab being specified also 
> in a distributed FS, so that applications can be submitted by servers (eg. 
> Livy, Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2018-10-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651840#comment-16651840
 ] 

Thomas Graves edited comment on SPARK-25732 at 10/16/18 2:49 PM:
-

sorry just realized I misread the second one, though it was kinit as user a.  
why would you run the second command?

I would actually expect that to fail or run as the super user unless it 
downloaded the keytab and kinit'd on submission before it did anything with 
hdfs, etc.


was (Author: tgraves):
sorry just realized I misread the second one.  why would you run the second 
command?

I would actually expect that to fail or run as the super user unless it 
downloaded the keytab and kinit'd on submission before it did anything with 
hdfs, etc.

> Allow specifying a keytab/principal for proxy user for token renewal 
> -
>
> Key: SPARK-25732
> URL: https://issues.apache.org/jira/browse/SPARK-25732
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> As of now, application submitted with proxy-user fail after 2 week due to the 
> lack of token renewal. In order to enable it, we need the the 
> keytab/principal of the impersonated user to be specified, in order to have 
> them available for the token renewal.
> This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
> {{--proxy-user-keytab}}, and the last letting a keytab being specified also 
> in a distributed FS, so that applications can be submitted by servers (eg. 
> Livy, Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2018-10-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651840#comment-16651840
 ] 

Thomas Graves edited comment on SPARK-25732 at 10/16/18 2:53 PM:
-

sorry just realized I misread the second one, thought it was kinit as user a.  
why would you run the second command?

I would actually expect that to fail or run as the super user unless it 
downloaded the keytab and kinit'd on submission before it did anything with 
hdfs, etc.

I guess that is the confusion you were referring to and can see that but it 
seems like an odd use case to me.  Is something submitting this way now?  It 
almost seems like something we should disallow.


was (Author: tgraves):
sorry just realized I misread the second one, though it was kinit as user a.  
why would you run the second command?

I would actually expect that to fail or run as the super user unless it 
downloaded the keytab and kinit'd on submission before it did anything with 
hdfs, etc.

> Allow specifying a keytab/principal for proxy user for token renewal 
> -
>
> Key: SPARK-25732
> URL: https://issues.apache.org/jira/browse/SPARK-25732
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> As of now, application submitted with proxy-user fail after 2 week due to the 
> lack of token renewal. In order to enable it, we need the the 
> keytab/principal of the impersonated user to be specified, in order to have 
> them available for the token renewal.
> This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
> {{--proxy-user-keytab}}, and the last letting a keytab being specified also 
> in a distributed FS, so that applications can be submitted by servers (eg. 
> Livy, Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2018-10-16 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651860#comment-16651860
 ] 

Marco Gaido commented on SPARK-25732:
-

[~tgraves] yes, exactly it is what I am referring as confusing. The point is, 
currently, if we specify --principal and --keytab they are used to login 
(please see 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L348).
 So those are the credential used (regardless of if/what you are kinited as).

> Allow specifying a keytab/principal for proxy user for token renewal 
> -
>
> Key: SPARK-25732
> URL: https://issues.apache.org/jira/browse/SPARK-25732
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> As of now, application submitted with proxy-user fail after 2 week due to the 
> lack of token renewal. In order to enable it, we need the the 
> keytab/principal of the impersonated user to be specified, in order to have 
> them available for the token renewal.
> This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
> {{--proxy-user-keytab}}, and the last letting a keytab being specified also 
> in a distributed FS, so that applications can be submitted by servers (eg. 
> Livy, Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25746) Refactoring ExpressionEncoder

2018-10-16 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-25746:
---

 Summary: Refactoring ExpressionEncoder
 Key: SPARK-25746
 URL: https://issues.apache.org/jira/browse/SPARK-25746
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


This is inspired during implementing SPARK-24762. For now ScalaReflection needs 
to consider how ExpressionEncoder uses generated serializers and deserializers. 
And ExpressionEncoder has a weird flat flag. After discussion with 
[~cloud_fan], it seems to be better to refactor ExpressionEncoder. It should 
make SPARK-24762 easier to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25746) Refactoring ExpressionEncoder

2018-10-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25746:


Assignee: (was: Apache Spark)

> Refactoring ExpressionEncoder
> -
>
> Key: SPARK-25746
> URL: https://issues.apache.org/jira/browse/SPARK-25746
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> This is inspired during implementing SPARK-24762. For now ScalaReflection 
> needs to consider how ExpressionEncoder uses generated serializers and 
> deserializers. And ExpressionEncoder has a weird flat flag. After discussion 
> with [~cloud_fan], it seems to be better to refactor ExpressionEncoder. It 
> should make SPARK-24762 easier to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25746) Refactoring ExpressionEncoder

2018-10-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651927#comment-16651927
 ] 

Apache Spark commented on SPARK-25746:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/22749

> Refactoring ExpressionEncoder
> -
>
> Key: SPARK-25746
> URL: https://issues.apache.org/jira/browse/SPARK-25746
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> This is inspired during implementing SPARK-24762. For now ScalaReflection 
> needs to consider how ExpressionEncoder uses generated serializers and 
> deserializers. And ExpressionEncoder has a weird flat flag. After discussion 
> with [~cloud_fan], it seems to be better to refactor ExpressionEncoder. It 
> should make SPARK-24762 easier to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25746) Refactoring ExpressionEncoder

2018-10-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25746:


Assignee: Apache Spark

> Refactoring ExpressionEncoder
> -
>
> Key: SPARK-25746
> URL: https://issues.apache.org/jira/browse/SPARK-25746
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> This is inspired during implementing SPARK-24762. For now ScalaReflection 
> needs to consider how ExpressionEncoder uses generated serializers and 
> deserializers. And ExpressionEncoder has a weird flat flag. After discussion 
> with [~cloud_fan], it seems to be better to refactor ExpressionEncoder. It 
> should make SPARK-24762 easier to do.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25747) remove ColumnarBatchScan.needsUnsafeRowConversion

2018-10-16 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-25747:
---

 Summary: remove ColumnarBatchScan.needsUnsafeRowConversion
 Key: SPARK-25747
 URL: https://issues.apache.org/jira/browse/SPARK-25747
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25747) remove ColumnarBatchScan.needsUnsafeRowConversion

2018-10-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25747:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove ColumnarBatchScan.needsUnsafeRowConversion
> -
>
> Key: SPARK-25747
> URL: https://issues.apache.org/jira/browse/SPARK-25747
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25747) remove ColumnarBatchScan.needsUnsafeRowConversion

2018-10-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652007#comment-16652007
 ] 

Apache Spark commented on SPARK-25747:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/22750

> remove ColumnarBatchScan.needsUnsafeRowConversion
> -
>
> Key: SPARK-25747
> URL: https://issues.apache.org/jira/browse/SPARK-25747
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25747) remove ColumnarBatchScan.needsUnsafeRowConversion

2018-10-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25747:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove ColumnarBatchScan.needsUnsafeRowConversion
> -
>
> Key: SPARK-25747
> URL: https://issues.apache.org/jira/browse/SPARK-25747
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal

2018-10-16 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652029#comment-16652029
 ] 

Marcelo Vanzin commented on SPARK-25732:


bq. livy server would not know which data sources to fetch tokens for 

That is true. Two approaches that currently exist are Spark's (get tokens for 
everything it can, even if they won't be used) and Oozie's (I believe; make the 
user explicitly choose at submission time which tokens the app will need).

bq. So the reason why I was proposing different configs is for clarity of the 
end user.

Why would the user care how the Spark application is started by a service? And 
in the second case you would not need the kinit for the service account.

bq. I think the other point is that giving to the external systems the 
responsibility of pushing tokens can cause an indefinite number of issues 

It also solves two issues with the other approach: not everybody has a keytab, 
and those who do generally dislike their keytabs being sent around the network 
and stored in a bunch of places.

> Allow specifying a keytab/principal for proxy user for token renewal 
> -
>
> Key: SPARK-25732
> URL: https://issues.apache.org/jira/browse/SPARK-25732
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> As of now, application submitted with proxy-user fail after 2 week due to the 
> lack of token renewal. In order to enable it, we need the the 
> keytab/principal of the impersonated user to be specified, in order to have 
> them available for the token renewal.
> This JIRA proposes to add two parameters {{--proxy-user-principal}} and 
> {{--proxy-user-keytab}}, and the last letting a keytab being specified also 
> in a distributed FS, so that applications can be submitted by servers (eg. 
> Livy, Zeppelin) without needing all users' principals being on that machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25643) Performance issues querying wide rows

2018-10-16 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652039#comment-16652039
 ] 

Ruslan Dautkhanov commented on SPARK-25643:
---

[~viirya] we confirm this problem on our production workloads too. Realizing 
wide tables that have columnar backends is super expensive. In comments of 
SPARK-25164 you can see that reading *even simple queries of fetching 70k rows 
takes 20 minutes* in a tables with 10m records. 

It would be great if Spark have optimizations to realize only columns that are 
required in `where` clause first, and after filtering realize rest of columns 
perhaps - it seems this would fix this huge performance overhead on wide 
datasets. 

Some key piece from [~bersprockets]'s findings are

{quote}According to initial profiling, it appears that most time is spent 
realizing the entire row in the scan, just so the filter can look at a tiny 
subset of columns and almost certainly throw the row away .. The profiling 
shows 74% of time is spent in FileSourceScanExec{quote}



> Performance issues querying wide rows
> -
>
> Key: SPARK-25643
> URL: https://issues.apache.org/jira/browse/SPARK-25643
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Major
>
> Querying a small subset of rows from a wide table (e.g., a table with 6000 
> columns) can be quite slow in the following case:
>  * the table has many rows (most of which will be filtered out)
>  * the projection includes every column of a wide table (i.e., select *)
>  * predicate push down is not helping: either matching rows are sprinkled 
> fairly evenly throughout the table, or predicate push down is switched off
> Even if the filter involves only a single column and the returned result 
> includes just a few rows, the query can run much longer compared to an 
> equivalent query against a similar table with fewer columns.
> According to initial profiling, it appears that most time is spent realizing 
> the entire row in the scan, just so the filter can look at a tiny subset of 
> columns and almost certainly throw the row away. The profiling shows 74% of 
> time is spent in FileSourceScanExec, and that time is spent across numerous 
> writeFields_0_xxx method calls.
> If Spark must realize the entire row just to check a tiny subset of columns, 
> this all sounds reasonable. However, I wonder if there is an optimization 
> here where we can avoid realizing the entire row until after the filter has 
> selected the row.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25742) Is there a way to pass the Azure blob storage credentials to the spark for k8s init-container?

2018-10-16 Thread Yinan Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652064#comment-16652064
 ] 

Yinan Li commented on SPARK-25742:
--

The k8s secrets you add through the {{spark.kubernetes.driver.secrets.}} config 
option will also get mounted into the init-container in the driver pod. You can 
use that to pass credential for pulling dependencies into the driver 
init-container.

> Is there a way to pass the Azure blob storage credentials to the spark for 
> k8s init-container?
> --
>
> Key: SPARK-25742
> URL: https://issues.apache.org/jira/browse/SPARK-25742
> Project: Spark
>  Issue Type: Question
>  Components: Kubernetes
>Affects Versions: 2.3.2
>Reporter: Oscar Bonilla
>Priority: Minor
>
> I'm trying to run spark on a kubernetes cluster in Azure. The idea is to 
> store the Spark application jars and dependencies in a container in Azure 
> Blob Storage.
> I've tried to do this with a public container and this works OK, but when 
> having a private Blob Storage container, the spark-init init container 
> doesn't download the jars.
> The equivalent in AWS S3 is as simple as adding the key_id and secret as 
> environment variables, but I don't see how to do this for Azure Blob Storage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25748) Upgrade net.razorvine:pyrolite version

2018-10-16 Thread Ankur Gupta (JIRA)
Ankur Gupta created SPARK-25748:
---

 Summary: Upgrade net.razorvine:pyrolite version
 Key: SPARK-25748
 URL: https://issues.apache.org/jira/browse/SPARK-25748
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Ankur Gupta


The following high security vulnerability was discovered: 
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2007-1100.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17875) Remove unneeded direct dependence on Netty 3.x

2018-10-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-17875:
---

> Remove unneeded direct dependence on Netty 3.x
> --
>
> Key: SPARK-17875
> URL: https://issues.apache.org/jira/browse/SPARK-17875
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>Priority: Trivial
>
> The Spark build declares a dependency on Netty 3.x and 4.x, but only 4.x is 
> used. It's best to remove the 3.x dependency (and while we're at it, update a 
> few things like license info)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25748) Upgrade net.razorvine:pyrolite version

2018-10-16 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652109#comment-16652109
 ] 

Marcelo Vanzin commented on SPARK-25748:


That CVE seems completely unrelated to pyrolite? Or am I missing something here?

> Upgrade net.razorvine:pyrolite version
> --
>
> Key: SPARK-25748
> URL: https://issues.apache.org/jira/browse/SPARK-25748
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Ankur Gupta
>Priority: Major
>
> The following high security vulnerability was discovered: 
> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2007-1100.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25749) Exception thrown while reading avro file with large schema

2018-10-16 Thread Raj (JIRA)
Raj created SPARK-25749:
---

 Summary: Exception thrown while reading avro file with large schema
 Key: SPARK-25749
 URL: https://issues.apache.org/jira/browse/SPARK-25749
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2, 2.3.1, 2.3.0
Reporter: Raj


Hi, We are migrating our jobs from Spark 2.2.0 to Spark 2.3.1. One of the job 
reads avro source that has large nested schema. The job fails for Spark 
2.3.1(Have tested in Spark 2.3.0 & Spark 2.3.2 and the job fails in this case 
also). I am able to replicate this with some sample data. Please find below the 
code, build file & exception log

*Code (EncoderExample.scala)*

 

package com.rj.enc

import com.rj.logger.CustomLogger
import org.apache.log4j.Logger
import com.rj.sc.SparkUtil
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.Encoders


object EncoderExample {
 
 val log: Logger = CustomLogger.getLogger(this.getClass.getName.dropRight(1))
 val user = "xxx"
 val sourcePath = s"file:///Users/$user/del/avrodata"
 val resultPath = s"file:///Users/$user/del/pqdata"
 
 def main(args: Array[String]): Unit = {
 writeData() // Create sample data 
 readData() // Read, Process & write back the results (App fails in this method 
for spark 2.3.1)
 }
 
 def readData(): Unit = {
 log.info("sourcePath -> " + sourcePath)
 val ss = SparkUtil.getSparkSession(this.getClass.getName)
 val schema = 
ScalaReflection.schemaFor[MainCC].dataType.asInstanceOf[StructType]
 import com.databricks.spark.avro._
 import ss.implicits._
 val ds = ss.sqlContext.read.schema(schema).option("basePath", sourcePath).
 avro(this.sourcePath).as[MainCC]
 log.info("Schema -> " + ds.schema.treeString)
 log.info("Count x -> " + ds.count)
 val encr = Encoders.product[ResultCC]
 val res = ds.map{ x =>
 val es: Long = x.header.tamp
 ResultCC(es = es)
 }(encr)
 res.write.parquet(this.resultPath)
 }
 
 def writeData(): Unit = {
 val ss = SparkUtil.getSparkSession(this.getClass.getName)
 import ss.implicits._
 val ds = ss.sparkContext.parallelize(Seq(MainCC(), MainCC())).toDF//.as[MainCC]
 log.info("source count 5 -> " + ds.count)
 import com.databricks.spark.avro._
 ds.write.avro(this.sourcePath)
 log.info("Written")
 }
 
}

final case class ResultCC(
 es: Long)

*Case Class (Schema of source avro data)*

package com.rj.enc

 

case class Header(tamp: Long = 12, xy: Option[String] = 
Some("aaa"))

 

case class Key(hi: Option[String] = 
Some("aaa"))

 

case class L30 (

  l1: Option[Double] = Some(123d)

,l2: Option[Double] = Some(123d)

 ,l3: Option[String] = Some("aaa")

 ,l4: Option[String] = Some("aaa")

 ,l5: Option[String] = Some("aaa")

 ,l6: Option[String] = Some("aaa")

 ,l7: Option[String] = Some("aaa")

)

 

case class C45 (

 r1: Option[String] = Some("aaa")

,r2: Option[String] = Some("aaa")

)

 

case class B45 (

  e1: Option[String] = Some("aaa")

,e2: Option[Int] = Some(123)

 ,e3: Option[String] = Some("aaa")

)

 

case class D45 (`t1`: Option[String] = 
Some("aaa"))

 

case class M30 (

b1: Option[B45] = Some(B45())

,b2: Option[C45] = Some(C45())

,b3: Option[D45] = Some(D45())

)

 

case class Y50 (

   g1: Option[String] = Some("aaa")

  ,g2: Option[String] = Some("aaa")

  ,g3: Option[String] = Some("aaa")

  ,g4: Option[String] = Some("aaa")

  ,g5: Option[String] = Some("aaa")

  ,g6: Option[String] = Some("aaa")

)

 

case class X50 (

  c1: Option[String] = Some("aaa")

 ,c2: Option[String] = Some("aaa")

 ,c3: Option[String] = Some("aaa")

 ,c4: Option[String] = Some("aaa")

)

 

case class L10 (

  u1: Option[String] = Some("aaa")

 ,u2: Option[String] = Some("aaa")

 ,u3: Option[String] = Some("aaa")

 ,u4: Option[String] = Some("aaa")

,u5: Option[Y50] = Some(Y50())

,u6: Option[X50] = Some(X50())

 ,u7: Option[String] = Some("a

[jira] [Updated] (SPARK-25749) Exception thrown while reading avro file with large schema

2018-10-16 Thread Raj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raj updated SPARK-25749:

Attachment: MainCC.scala
EncoderExample.scala

> Exception thrown while reading avro file with large schema
> --
>
> Key: SPARK-25749
> URL: https://issues.apache.org/jira/browse/SPARK-25749
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Raj
>Priority: Blocker
> Attachments: EncoderExample.scala, MainCC.scala
>
>
> Hi, We are migrating our jobs from Spark 2.2.0 to Spark 2.3.1. One of the job 
> reads avro source that has large nested schema. The job fails for Spark 
> 2.3.1(Have tested in Spark 2.3.0 & Spark 2.3.2 and the job fails in this case 
> also). I am able to replicate this with some sample data. Please find below 
> the code, build file & exception log
> *Code (EncoderExample.scala)*
>  
> package com.rj.enc
> import com.rj.logger.CustomLogger
> import org.apache.log4j.Logger
> import com.rj.sc.SparkUtil
> import org.apache.spark.sql.catalyst.ScalaReflection
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.Encoders
> object EncoderExample {
>  
>  val log: Logger = CustomLogger.getLogger(this.getClass.getName.dropRight(1))
>  val user = "xxx"
>  val sourcePath = s"file:///Users/$user/del/avrodata"
>  val resultPath = s"file:///Users/$user/del/pqdata"
>  
>  def main(args: Array[String]): Unit = {
>  writeData() // Create sample data 
>  readData() // Read, Process & write back the results (App fails in this 
> method for spark 2.3.1)
>  }
>  
>  def readData(): Unit = {
>  log.info("sourcePath -> " + sourcePath)
>  val ss = SparkUtil.getSparkSession(this.getClass.getName)
>  val schema = 
> ScalaReflection.schemaFor[MainCC].dataType.asInstanceOf[StructType]
>  import com.databricks.spark.avro._
>  import ss.implicits._
>  val ds = ss.sqlContext.read.schema(schema).option("basePath", sourcePath).
>  avro(this.sourcePath).as[MainCC]
>  log.info("Schema -> " + ds.schema.treeString)
>  log.info("Count x -> " + ds.count)
>  val encr = Encoders.product[ResultCC]
>  val res = ds.map{ x =>
>  val es: Long = x.header.tamp
>  ResultCC(es = es)
>  }(encr)
>  res.write.parquet(this.resultPath)
>  }
>  
>  def writeData(): Unit = {
>  val ss = SparkUtil.getSparkSession(this.getClass.getName)
>  import ss.implicits._
>  val ds = ss.sparkContext.parallelize(Seq(MainCC(), 
> MainCC())).toDF//.as[MainCC]
>  log.info("source count 5 -> " + ds.count)
>  import com.databricks.spark.avro._
>  ds.write.avro(this.sourcePath)
>  log.info("Written")
>  }
>  
> }
> final case class ResultCC(
>  es: Long)
> *Case Class (Schema of source avro data)*
> package com.rj.enc
>  
> case class Header(tamp: Long = 12, xy: Option[String] = 
> Some("aaa"))
>  
> case class Key(hi: Option[String] = 
> Some("aaa"))
>  
> case class L30 (
>   l1: Option[Double] = Some(123d)
> ,l2: Option[Double] = Some(123d)
>  ,l3: Option[String] = Some("aaa")
>  ,l4: Option[String] = Some("aaa")
>  ,l5: Option[String] = Some("aaa")
>  ,l6: Option[String] = Some("aaa")
>  ,l7: Option[String] = Some("aaa")
> )
>  
> case class C45 (
>  r1: Option[String] = Some("aaa")
> ,r2: Option[String] = Some("aaa")
> )
>  
> case class B45 (
>   e1: Option[String] = Some("aaa")
> ,e2: Option[Int] = Some(123)
>  ,e3: Option[String] = Some("aaa")
> )
>  
> case class D45 (`t1`: Option[String] = 
> Some("aaa"))
>  
> case class M30 (
> b1: Option[B45] = Some(B45())
> ,b2: Option[C45] = Some(C45())
> ,b3: Option[D45] = Some(D45())
> )
>  
> case class Y50 (
>    g1: Option[String] = Some("aaa")
>   ,g2: Option[String] = Some("aaa")
>   ,g3: Option[String] = Some("aaa")
>   ,g4: Option[String] = Some("aaa")
>   ,g5: Option[String] = Some("aaa")
>   ,g6: Option[String] = Some("aaa")
> )
>  
> case class X50 (
>   c1: Option[String] = Some("aaa")
>  ,c2: Option[String] = Some("aaa")
>  ,c3: Option[String] = Some("aaa")

[jira] [Updated] (SPARK-25749) Exception thrown while reading avro file with large schema

2018-10-16 Thread Raj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raj updated SPARK-25749:

Attachment: build.sbt

> Exception thrown while reading avro file with large schema
> --
>
> Key: SPARK-25749
> URL: https://issues.apache.org/jira/browse/SPARK-25749
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Raj
>Priority: Blocker
> Attachments: EncoderExample.scala, MainCC.scala, build.sbt
>
>
> Hi, We are migrating our jobs from Spark 2.2.0 to Spark 2.3.1. One of the job 
> reads avro source that has large nested schema. The job fails for Spark 
> 2.3.1(Have tested in Spark 2.3.0 & Spark 2.3.2 and the job fails in this case 
> also). I am able to replicate this with some sample data. Please find below 
> the code, build file & exception log
> *Code (EncoderExample.scala)*
>  
> package com.rj.enc
> import com.rj.logger.CustomLogger
> import org.apache.log4j.Logger
> import com.rj.sc.SparkUtil
> import org.apache.spark.sql.catalyst.ScalaReflection
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.Encoders
> object EncoderExample {
>  
>  val log: Logger = CustomLogger.getLogger(this.getClass.getName.dropRight(1))
>  val user = "xxx"
>  val sourcePath = s"file:///Users/$user/del/avrodata"
>  val resultPath = s"file:///Users/$user/del/pqdata"
>  
>  def main(args: Array[String]): Unit = {
>  writeData() // Create sample data 
>  readData() // Read, Process & write back the results (App fails in this 
> method for spark 2.3.1)
>  }
>  
>  def readData(): Unit = {
>  log.info("sourcePath -> " + sourcePath)
>  val ss = SparkUtil.getSparkSession(this.getClass.getName)
>  val schema = 
> ScalaReflection.schemaFor[MainCC].dataType.asInstanceOf[StructType]
>  import com.databricks.spark.avro._
>  import ss.implicits._
>  val ds = ss.sqlContext.read.schema(schema).option("basePath", sourcePath).
>  avro(this.sourcePath).as[MainCC]
>  log.info("Schema -> " + ds.schema.treeString)
>  log.info("Count x -> " + ds.count)
>  val encr = Encoders.product[ResultCC]
>  val res = ds.map{ x =>
>  val es: Long = x.header.tamp
>  ResultCC(es = es)
>  }(encr)
>  res.write.parquet(this.resultPath)
>  }
>  
>  def writeData(): Unit = {
>  val ss = SparkUtil.getSparkSession(this.getClass.getName)
>  import ss.implicits._
>  val ds = ss.sparkContext.parallelize(Seq(MainCC(), 
> MainCC())).toDF//.as[MainCC]
>  log.info("source count 5 -> " + ds.count)
>  import com.databricks.spark.avro._
>  ds.write.avro(this.sourcePath)
>  log.info("Written")
>  }
>  
> }
> final case class ResultCC(
>  es: Long)
> *Case Class (Schema of source avro data)*
> package com.rj.enc
>  
> case class Header(tamp: Long = 12, xy: Option[String] = 
> Some("aaa"))
>  
> case class Key(hi: Option[String] = 
> Some("aaa"))
>  
> case class L30 (
>   l1: Option[Double] = Some(123d)
> ,l2: Option[Double] = Some(123d)
>  ,l3: Option[String] = Some("aaa")
>  ,l4: Option[String] = Some("aaa")
>  ,l5: Option[String] = Some("aaa")
>  ,l6: Option[String] = Some("aaa")
>  ,l7: Option[String] = Some("aaa")
> )
>  
> case class C45 (
>  r1: Option[String] = Some("aaa")
> ,r2: Option[String] = Some("aaa")
> )
>  
> case class B45 (
>   e1: Option[String] = Some("aaa")
> ,e2: Option[Int] = Some(123)
>  ,e3: Option[String] = Some("aaa")
> )
>  
> case class D45 (`t1`: Option[String] = 
> Some("aaa"))
>  
> case class M30 (
> b1: Option[B45] = Some(B45())
> ,b2: Option[C45] = Some(C45())
> ,b3: Option[D45] = Some(D45())
> )
>  
> case class Y50 (
>    g1: Option[String] = Some("aaa")
>   ,g2: Option[String] = Some("aaa")
>   ,g3: Option[String] = Some("aaa")
>   ,g4: Option[String] = Some("aaa")
>   ,g5: Option[String] = Some("aaa")
>   ,g6: Option[String] = Some("aaa")
> )
>  
> case class X50 (
>   c1: Option[String] = Some("aaa")
>  ,c2: Option[String] = Some("aaa")
>  ,c3: Option[String] = Some("aaa")
>  ,c4: Option[String] = Some

[jira] [Updated] (SPARK-25749) Exception thrown while reading avro file with large schema

2018-10-16 Thread Raj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raj updated SPARK-25749:

Attachment: exception

> Exception thrown while reading avro file with large schema
> --
>
> Key: SPARK-25749
> URL: https://issues.apache.org/jira/browse/SPARK-25749
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Raj
>Priority: Blocker
> Attachments: EncoderExample.scala, MainCC.scala, build.sbt, exception
>
>
> Hi, We are migrating our jobs from Spark 2.2.0 to Spark 2.3.1. One of the job 
> reads avro source that has large nested schema. The job fails for Spark 
> 2.3.1(Have tested in Spark 2.3.0 & Spark 2.3.2 and the job fails in this case 
> also). I am able to replicate this with some sample data. Please find below 
> the code, build file & exception log
> *Code (EncoderExample.scala)*
>  
> package com.rj.enc
> import com.rj.logger.CustomLogger
> import org.apache.log4j.Logger
> import com.rj.sc.SparkUtil
> import org.apache.spark.sql.catalyst.ScalaReflection
> import org.apache.spark.sql.types.StructType
> import org.apache.spark.sql.Encoders
> object EncoderExample {
>  
>  val log: Logger = CustomLogger.getLogger(this.getClass.getName.dropRight(1))
>  val user = "xxx"
>  val sourcePath = s"file:///Users/$user/del/avrodata"
>  val resultPath = s"file:///Users/$user/del/pqdata"
>  
>  def main(args: Array[String]): Unit = {
>  writeData() // Create sample data 
>  readData() // Read, Process & write back the results (App fails in this 
> method for spark 2.3.1)
>  }
>  
>  def readData(): Unit = {
>  log.info("sourcePath -> " + sourcePath)
>  val ss = SparkUtil.getSparkSession(this.getClass.getName)
>  val schema = 
> ScalaReflection.schemaFor[MainCC].dataType.asInstanceOf[StructType]
>  import com.databricks.spark.avro._
>  import ss.implicits._
>  val ds = ss.sqlContext.read.schema(schema).option("basePath", sourcePath).
>  avro(this.sourcePath).as[MainCC]
>  log.info("Schema -> " + ds.schema.treeString)
>  log.info("Count x -> " + ds.count)
>  val encr = Encoders.product[ResultCC]
>  val res = ds.map{ x =>
>  val es: Long = x.header.tamp
>  ResultCC(es = es)
>  }(encr)
>  res.write.parquet(this.resultPath)
>  }
>  
>  def writeData(): Unit = {
>  val ss = SparkUtil.getSparkSession(this.getClass.getName)
>  import ss.implicits._
>  val ds = ss.sparkContext.parallelize(Seq(MainCC(), 
> MainCC())).toDF//.as[MainCC]
>  log.info("source count 5 -> " + ds.count)
>  import com.databricks.spark.avro._
>  ds.write.avro(this.sourcePath)
>  log.info("Written")
>  }
>  
> }
> final case class ResultCC(
>  es: Long)
> *Case Class (Schema of source avro data)*
> package com.rj.enc
>  
> case class Header(tamp: Long = 12, xy: Option[String] = 
> Some("aaa"))
>  
> case class Key(hi: Option[String] = 
> Some("aaa"))
>  
> case class L30 (
>   l1: Option[Double] = Some(123d)
> ,l2: Option[Double] = Some(123d)
>  ,l3: Option[String] = Some("aaa")
>  ,l4: Option[String] = Some("aaa")
>  ,l5: Option[String] = Some("aaa")
>  ,l6: Option[String] = Some("aaa")
>  ,l7: Option[String] = Some("aaa")
> )
>  
> case class C45 (
>  r1: Option[String] = Some("aaa")
> ,r2: Option[String] = Some("aaa")
> )
>  
> case class B45 (
>   e1: Option[String] = Some("aaa")
> ,e2: Option[Int] = Some(123)
>  ,e3: Option[String] = Some("aaa")
> )
>  
> case class D45 (`t1`: Option[String] = 
> Some("aaa"))
>  
> case class M30 (
> b1: Option[B45] = Some(B45())
> ,b2: Option[C45] = Some(C45())
> ,b3: Option[D45] = Some(D45())
> )
>  
> case class Y50 (
>    g1: Option[String] = Some("aaa")
>   ,g2: Option[String] = Some("aaa")
>   ,g3: Option[String] = Some("aaa")
>   ,g4: Option[String] = Some("aaa")
>   ,g5: Option[String] = Some("aaa")
>   ,g6: Option[String] = Some("aaa")
> )
>  
> case class X50 (
>   c1: Option[String] = Some("aaa")
>  ,c2: Option[String] = Some("aaa")
>  ,c3: Option[String] = Some("aaa")
>  ,c4: Option[Str

[jira] [Updated] (SPARK-25749) Exception thrown while reading avro file with large schema

2018-10-16 Thread Raj (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raj updated SPARK-25749:

Description: 
Hi, We are migrating our jobs from Spark 2.2.0 to Spark 2.3.1. One of the job 
reads avro source that has large nested schema. The job fails for Spark 
2.3.1(Have tested in Spark 2.3.0 & Spark 2.3.2 and the job fails in this case 
also). I am able to replicate this with some sample data + dummy case class. 
Please find attached the,

*Code*: EncoderExample.scala, MainCC.scala & build.sbt

*Exception log*: exception

PS:

I am getting exception \{{java.lang.OutOfMemoryError: Java heap space}}. I have 
tried increasing the JVM size in eclipse, but that does not help either

I have also tested the code in Spark 2.2.2 and works fine. Seems like this bug 
introduced in Spark 2.3.0

 

  was:
Hi, We are migrating our jobs from Spark 2.2.0 to Spark 2.3.1. One of the job 
reads avro source that has large nested schema. The job fails for Spark 
2.3.1(Have tested in Spark 2.3.0 & Spark 2.3.2 and the job fails in this case 
also). I am able to replicate this with some sample data. Please find below the 
code, build file & exception log

*Code (EncoderExample.scala)*

 

package com.rj.enc

import com.rj.logger.CustomLogger
import org.apache.log4j.Logger
import com.rj.sc.SparkUtil
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.Encoders


object EncoderExample {
 
 val log: Logger = CustomLogger.getLogger(this.getClass.getName.dropRight(1))
 val user = "xxx"
 val sourcePath = s"file:///Users/$user/del/avrodata"
 val resultPath = s"file:///Users/$user/del/pqdata"
 
 def main(args: Array[String]): Unit = {
 writeData() // Create sample data 
 readData() // Read, Process & write back the results (App fails in this method 
for spark 2.3.1)
 }
 
 def readData(): Unit = {
 log.info("sourcePath -> " + sourcePath)
 val ss = SparkUtil.getSparkSession(this.getClass.getName)
 val schema = 
ScalaReflection.schemaFor[MainCC].dataType.asInstanceOf[StructType]
 import com.databricks.spark.avro._
 import ss.implicits._
 val ds = ss.sqlContext.read.schema(schema).option("basePath", sourcePath).
 avro(this.sourcePath).as[MainCC]
 log.info("Schema -> " + ds.schema.treeString)
 log.info("Count x -> " + ds.count)
 val encr = Encoders.product[ResultCC]
 val res = ds.map{ x =>
 val es: Long = x.header.tamp
 ResultCC(es = es)
 }(encr)
 res.write.parquet(this.resultPath)
 }
 
 def writeData(): Unit = {
 val ss = SparkUtil.getSparkSession(this.getClass.getName)
 import ss.implicits._
 val ds = ss.sparkContext.parallelize(Seq(MainCC(), MainCC())).toDF//.as[MainCC]
 log.info("source count 5 -> " + ds.count)
 import com.databricks.spark.avro._
 ds.write.avro(this.sourcePath)
 log.info("Written")
 }
 
}

final case class ResultCC(
 es: Long)

*Case Class (Schema of source avro data)*

package com.rj.enc

 

case class Header(tamp: Long = 12, xy: Option[String] = 
Some("aaa"))

 

case class Key(hi: Option[String] = 
Some("aaa"))

 

case class L30 (

  l1: Option[Double] = Some(123d)

,l2: Option[Double] = Some(123d)

 ,l3: Option[String] = Some("aaa")

 ,l4: Option[String] = Some("aaa")

 ,l5: Option[String] = Some("aaa")

 ,l6: Option[String] = Some("aaa")

 ,l7: Option[String] = Some("aaa")

)

 

case class C45 (

 r1: Option[String] = Some("aaa")

,r2: Option[String] = Some("aaa")

)

 

case class B45 (

  e1: Option[String] = Some("aaa")

,e2: Option[Int] = Some(123)

 ,e3: Option[String] = Some("aaa")

)

 

case class D45 (`t1`: Option[String] = 
Some("aaa"))

 

case class M30 (

b1: Option[B45] = Some(B45())

,b2: Option[C45] = Some(C45())

,b3: Option[D45] = Some(D45())

)

 

case class Y50 (

   g1: Option[String] = Some("aaa")

  ,g2: Option[String] = Some("aaa")

  ,g3: Option[String] = Some("aaa")

  ,g4: Option[String] = Some("aaa")

  ,g5: Option[String] = Some("aaa")

  ,g6: Option[String] = Some("aaa")

)

 

case class X50 (

  c1: Option[String] = Some("aaa")

 ,c2: Option[String] = Some("aaa")

 ,c3: Option[String] = Some("aaa")

[jira] [Resolved] (SPARK-25748) Upgrade net.razorvine:pyrolite version

2018-10-16 Thread Ankur Gupta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Gupta resolved SPARK-25748.
-
Resolution: Invalid

Yeah, it seems this jar was tagged incorrectly in the internal security scan. 
Closing the jira.

> Upgrade net.razorvine:pyrolite version
> --
>
> Key: SPARK-25748
> URL: https://issues.apache.org/jira/browse/SPARK-25748
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Ankur Gupta
>Priority: Major
>
> The following high security vulnerability was discovered: 
> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2007-1100.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20327) Add CLI support for YARN custom resources, like GPUs

2018-10-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652393#comment-16652393
 ] 

Apache Spark commented on SPARK-20327:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22751

> Add CLI support for YARN custom resources, like GPUs
> 
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Major
>  Labels: newbie
> Fix For: 3.0.0
>
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20327) Add CLI support for YARN custom resources, like GPUs

2018-10-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652395#comment-16652395
 ] 

Apache Spark commented on SPARK-20327:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/22751

> Add CLI support for YARN custom resources, like GPUs
> 
>
> Key: SPARK-20327
> URL: https://issues.apache.org/jira/browse/SPARK-20327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Daniel Templeton
>Assignee: Szilard Nemeth
>Priority: Major
>  Labels: newbie
> Fix For: 3.0.0
>
>
> YARN-3926 adds the ability for administrators to configure custom resources, 
> like GPUs.  This JIRA is to add support to Spark for requesting resources 
> other than CPU virtual cores and memory.  See YARN-3926.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25750) Secure HDFS Integration Testing

2018-10-16 Thread Ilan Filonenko (JIRA)
Ilan Filonenko created SPARK-25750:
--

 Summary: Secure HDFS Integration Testing
 Key: SPARK-25750
 URL: https://issues.apache.org/jira/browse/SPARK-25750
 Project: Spark
  Issue Type: Test
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Ilan Filonenko


Integration testing for Secure HDFS interaction for Spark on Kubernetes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25750) Secure HDFS Integration Testing

2018-10-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25750:


Assignee: Apache Spark

> Secure HDFS Integration Testing
> ---
>
> Key: SPARK-25750
> URL: https://issues.apache.org/jira/browse/SPARK-25750
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Assignee: Apache Spark
>Priority: Major
>
> Integration testing for Secure HDFS interaction for Spark on Kubernetes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25750) Secure HDFS Integration Testing

2018-10-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652400#comment-16652400
 ] 

Apache Spark commented on SPARK-25750:
--

User 'ifilonenko' has created a pull request for this issue:
https://github.com/apache/spark/pull/22608

> Secure HDFS Integration Testing
> ---
>
> Key: SPARK-25750
> URL: https://issues.apache.org/jira/browse/SPARK-25750
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> Integration testing for Secure HDFS interaction for Spark on Kubernetes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25750) Secure HDFS Integration Testing

2018-10-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25750:


Assignee: (was: Apache Spark)

> Secure HDFS Integration Testing
> ---
>
> Key: SPARK-25750
> URL: https://issues.apache.org/jira/browse/SPARK-25750
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> Integration testing for Secure HDFS interaction for Spark on Kubernetes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive

2018-10-16 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652440#comment-16652440
 ] 

t oo commented on SPARK-20202:
--

bump

> Remove references to org.spark-project.hive
> ---
>
> Key: SPARK-20202
> URL: https://issues.apache.org/jira/browse/SPARK-20202
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.6.4, 2.0.3, 2.1.1
>Reporter: Owen O'Malley
>Priority: Major
>
> Spark can't continue to depend on their fork of Hive and must move to 
> standard Hive versions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25751) Unit Testing for Kerberos Support for Spark on Kubernetes

2018-10-16 Thread Ilan Filonenko (JIRA)
Ilan Filonenko created SPARK-25751:
--

 Summary: Unit Testing for Kerberos Support for Spark on Kubernetes
 Key: SPARK-25751
 URL: https://issues.apache.org/jira/browse/SPARK-25751
 Project: Spark
  Issue Type: Test
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Ilan Filonenko


Unit tests for Kerberos Support within Spark on Kubernetes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-10-16 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652449#comment-16652449
 ] 

t oo commented on SPARK-18673:
--

bump

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25750) Integration Testing for Kerberos Support for Spark on Kubernetes

2018-10-16 Thread Ilan Filonenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilan Filonenko updated SPARK-25750:
---
Summary: Integration Testing for Kerberos Support for Spark on Kubernetes  
(was: Secure HDFS Integration Testing)

> Integration Testing for Kerberos Support for Spark on Kubernetes
> 
>
> Key: SPARK-25750
> URL: https://issues.apache.org/jira/browse/SPARK-25750
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Ilan Filonenko
>Priority: Major
>
> Integration testing for Secure HDFS interaction for Spark on Kubernetes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25752) Add trait to easily whitelist logical operators that produce named output from CleanupAliases

2018-10-16 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-25752:
-

 Summary: Add trait to easily whitelist logical operators that 
produce named output from CleanupAliases
 Key: SPARK-25752
 URL: https://issues.apache.org/jira/browse/SPARK-25752
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Tathagata Das
Assignee: Tathagata Das


The rule `CleanupAliases` cleans up aliases from logical operators that do not 
match a whitelist. This whitelist is hardcoded inside the rule which is 
cumbersome. This PR is to clean that up by making a trait `HasNamedOutput` that 
will be ignored by `CleanupAliases` and other ops that require aliases to be 
preserved in the operator should extend it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25394) Expose App status metrics as Source

2018-10-16 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-25394:
--

Assignee: Stavros Kontopoulos

> Expose App status metrics as Source
> ---
>
> Key: SPARK-25394
> URL: https://issues.apache.org/jira/browse/SPARK-25394
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> ApplicationListener in Spark core captures useful metrics like job duration 
> which are exposed via the spark rest api (available both at the driver's ui 
> and the history server ui) only. From a devops perspective and especially on 
> k8s where metrics scrapping is often done via the prometheus jmx exporter it 
> would be good to have all metrics in one place, avoidn scraping the metrics 
> rest api.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25394) Expose App status metrics as Source

2018-10-16 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25394.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22381
[https://github.com/apache/spark/pull/22381]

> Expose App status metrics as Source
> ---
>
> Key: SPARK-25394
> URL: https://issues.apache.org/jira/browse/SPARK-25394
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Stavros Kontopoulos
>Assignee: Stavros Kontopoulos
>Priority: Major
> Fix For: 3.0.0
>
>
> ApplicationListener in Spark core captures useful metrics like job duration 
> which are exposed via the spark rest api (available both at the driver's ui 
> and the history server ui) only. From a devops perspective and especially on 
> k8s where metrics scrapping is often done via the prometheus jmx exporter it 
> would be good to have all metrics in one place, avoidn scraping the metrics 
> rest api.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25631) KafkaRDDSuite: basic usage 2 min 4 sec

2018-10-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25631.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22670
[https://github.com/apache/spark/pull/22670]

> KafkaRDDSuite: basic usage2 min 4 sec
> ---
>
> Key: SPARK-25631
> URL: https://issues.apache.org/jira/browse/SPARK-25631
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>
> org.apache.spark.streaming.kafka010.KafkaRDDSuite.basic usage
> Took 2 min 4 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25631) KafkaRDDSuite: basic usage 2 min 4 sec

2018-10-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25631:
-

Assignee: Dilip Biswal

> KafkaRDDSuite: basic usage2 min 4 sec
> ---
>
> Key: SPARK-25631
> URL: https://issues.apache.org/jira/browse/SPARK-25631
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>
> org.apache.spark.streaming.kafka010.KafkaRDDSuite.basic usage
> Took 2 min 4 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25632) KafkaRDDSuite: compacted topic 2 min 5 sec.

2018-10-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25632.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22670
[https://github.com/apache/spark/pull/22670]

> KafkaRDDSuite: compacted topic 2 min 5 sec.
> ---
>
> Key: SPARK-25632
> URL: https://issues.apache.org/jira/browse/SPARK-25632
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>
> org.apache.spark.streaming.kafka010.KafkaRDDSuite.compacted topic
> Took 2 min 5 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25632) KafkaRDDSuite: compacted topic 2 min 5 sec.

2018-10-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-25632:
-

Assignee: Dilip Biswal

> KafkaRDDSuite: compacted topic 2 min 5 sec.
> ---
>
> Key: SPARK-25632
> URL: https://issues.apache.org/jira/browse/SPARK-25632
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 3.0.0
>
>
> org.apache.spark.streaming.kafka010.KafkaRDDSuite.compacted topic
> Took 2 min 5 sec.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25693) Fix the multiple "manager" classes in org.apache.spark.deploy.security

2018-10-16 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-25693.

Resolution: Duplicate

I started looking at this while waiting for reviews of SPARK-23781, and it will 
be less noisy if I do both together. It ends up being not that much larger than 
SPARK-23781 in isolation.

> Fix the multiple "manager" classes in org.apache.spark.deploy.security
> --
>
> Key: SPARK-25693
> URL: https://issues.apache.org/jira/browse/SPARK-25693
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> In SPARK-23781 I've done some refactoring which introduces an 
> {{AbstractCredentialManager}} class. That name sort of clashes with the 
> existing {{HadoopDelegationTokenManager}}.
> Since the latter doesn't really manage anything, it just orchestrates the 
> fetching of the tokens, we could rename it to something more generic. Or even 
> clean up some of that class hierarchy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-10-16 Thread Devaraj K (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652687#comment-16652687
 ] 

Devaraj K commented on SPARK-24787:
---

It seems here the overhead is coming due the force call FileChannel.force in 
Datanode which is part of the hsync to write the data to the storage device. 
And the hsync is not making much difference with and without the flag 
SyncFlag.UPDATE_LENGTH, it might be because the update length is simple call to 
NameNode to update the length.

I think the hsync change can be reverted, and the history server can get the 
latest file length using the DFSInputStream.getFileLength() which includes 
lastBlockBeingWrittenLength, if the cached length is same as 
FileStatus.getLen() then history server can make additional call to get the 
latest length using DFSInputStream.getFileLength() and decide whether to update 
the history log or not.

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> -
>
> Key: SPARK-24787
> URL: https://issues.apache.org/jira/browse/SPARK-24787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Sanket Reddy
>Priority: Minor
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 6 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-10-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24787:


Assignee: Apache Spark

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> -
>
> Key: SPARK-24787
> URL: https://issues.apache.org/jira/browse/SPARK-24787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Sanket Reddy
>Assignee: Apache Spark
>Priority: Minor
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 6 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-10-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24787:


Assignee: (was: Apache Spark)

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> -
>
> Key: SPARK-24787
> URL: https://issues.apache.org/jira/browse/SPARK-24787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Sanket Reddy
>Priority: Minor
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 6 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-10-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652691#comment-16652691
 ] 

Apache Spark commented on SPARK-24787:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/22752

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> -
>
> Key: SPARK-24787
> URL: https://issues.apache.org/jira/browse/SPARK-24787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Sanket Reddy
>Priority: Minor
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 6 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-10-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652690#comment-16652690
 ] 

Apache Spark commented on SPARK-24787:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/22752

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> -
>
> Key: SPARK-24787
> URL: https://issues.apache.org/jira/browse/SPARK-24787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Sanket Reddy
>Priority: Minor
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 6 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25733) The method toLocalIterator() with dataframe doesn't work

2018-10-16 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652746#comment-16652746
 ] 

Bryan Cutler commented on SPARK-25733:
--

Is this a duplicate of SPARK-23961?

> The method toLocalIterator() with dataframe doesn't work
> 
>
> Key: SPARK-25733
> URL: https://issues.apache.org/jira/browse/SPARK-25733
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Spark in standalone mode, and 48 cores are available.
> spark-defaults.conf as blew:
> spark.pyshark.python /usr/bin/python3.6
> spark.driver.memory 4g
> spark.executor.memory 8g
>  
> other configurations are at default.
>Reporter: Bihui Jin
>Priority: Major
> Attachments: report_dataset.zip.001, report_dataset.zip.002
>
>
> {color:#FF}The dataset which I used attached.{color}
>  
> First I loaded a dataframe from local disk:
> df = spark.read.load('report_dataset')
> there are about 200 partitions stored in s3, and the max size of partitions 
> is 28.37MB.
>  
> after data loaded,  I execute "df.take(1)" to test the dataframe, and 
> expected output printed 
> "[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html',
>  sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, 
> 49, 389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, 
> 125, 85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, 
> 29, 179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, 
> 29, 179, 180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 
> 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, 
> 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 
> 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], 
> next_word=575, line_num=12)]" 
>  
> Then I try to convert dataframe to the local iterator and want to print one 
> row in dataframe for testing, and blew code is used:
> for row in df.toLocalIterator():
>     print(row)
>     break
> {color:#ff}*But there is no output printed after that code 
> executed.*{color}
>  
> Then I execute "df.take(1)" and blew error is reported:
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1159, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 985, in send_command
> response = connection.send_command(command)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1164, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java 
> server (127.0.0.1:37735)
> Traceback (most recent call last):
> File 
> "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
> line 2963, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns)
> File "", line 1, in 
> df.take(1)
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
> 504, in take
> return self.limit(num).collect()
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line 
> 493, in limit
> jdf = self._jdf.limit(num)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line 
> 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
> File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, 
> in deco
> return f(*a, **kw)
> File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in 
> get_return_value
> format(target_id, ".", name))
> py4j.protocol.Py4JError: An error occurred while calling o29.limit
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File 
> "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", 
> line 1863, in showtraceback
> stb = v

[jira] [Created] (SPARK-25753) binaryFiles broken for small files

2018-10-16 Thread liuxian (JIRA)
liuxian created SPARK-25753:
---

 Summary: binaryFiles broken for small files
 Key: SPARK-25753
 URL: https://issues.apache.org/jira/browse/SPARK-25753
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 3.0.0
Reporter: liuxian


{{StreamFileInputFormat}} and 
{{WholeTextFileInputFormat(https://issues.apache.org/jira/browse/SPARK-24610)}} 
have the same problem: for small sized files, the computed maxSplitSize by 
`{{StreamFileInputFormat}} `  is way smaller than the default or commonly used 
split size of 64/128M and spark throws an exception while trying to read them.

{{Exception info:Minimum split size pernode 5123456 cannot be larger than 
maximum split size 4194304 java.io.IOException: Minimum split size pernode 
5123456 cannot be larger than maximum split size 4194304 at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:
 201) at 
org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >