[jira] [Resolved] (SPARK-25736) add tests to verify the behavior of multi-column count
[ https://issues.apache.org/jira/browse/SPARK-25736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25736. -- Resolution: Fixed Fix Version/s: 2.4.0 3.0.0 Issue resolved by pull request 22728 [https://github.com/apache/spark/pull/22728] > add tests to verify the behavior of multi-column count > -- > > Key: SPARK-25736 > URL: https://issues.apache.org/jira/browse/SPARK-25736 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Minor > Fix For: 3.0.0, 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25740) Set some configuration need invalidateStatsCache
Yuming Wang created SPARK-25740: --- Summary: Set some configuration need invalidateStatsCache Key: SPARK-25740 URL: https://issues.apache.org/jira/browse/SPARK-25740 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang How to reproduce: {code:sql} # spark-sql create table t1 (a int) stored as parquet; create table t2 (a int) stored as parquet; insert into table t1 values (1); insert into table t2 values (1); explain select * from t1, t2 where t1.a = t2.a; exit; spark-sql explain select * from t1, t2 where t1.a = t2.a; -- SortMergeJoin set spark.sql.statistics.fallBackToHdfs=true; explain select * from t1, t2 where t1.a = t2.a; -- SortMergeJoin, it should be BroadcastHashJoin exit; spark-sql set spark.sql.statistics.fallBackToHdfs=true; explain select * from t1, t2 where t1.a = t2.a; -- BroadcastHashJoin {code} We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do is invalidateAllCachedTables when execute set Command: {code} val isInvalidateAllCachedTablesKeys = Set( SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key, SQLConf.DEFAULT_SIZE_IN_BYTES.key ) sparkSession.conf.set(key, value) if (isInvalidateAllCachedTablesKeys.contains(key)) { sparkSession.sessionState.catalog.invalidateAllCachedTables() } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25740) Set some configuration need invalidateStatsCache
[ https://issues.apache.org/jira/browse/SPARK-25740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25740: Assignee: Apache Spark > Set some configuration need invalidateStatsCache > > > Key: SPARK-25740 > URL: https://issues.apache.org/jira/browse/SPARK-25740 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > How to reproduce: > {code:sql} > # spark-sql > create table t1 (a int) stored as parquet; > create table t2 (a int) stored as parquet; > insert into table t1 values (1); > insert into table t2 values (1); > explain select * from t1, t2 where t1.a = t2.a; > exit; > spark-sql > explain select * from t1, t2 where t1.a = t2.a; > -- SortMergeJoin > set spark.sql.statistics.fallBackToHdfs=true; > explain select * from t1, t2 where t1.a = t2.a; > -- SortMergeJoin, it should be BroadcastHashJoin > exit; > spark-sql > set spark.sql.statistics.fallBackToHdfs=true; > explain select * from t1, t2 where t1.a = t2.a; > -- BroadcastHashJoin > {code} > We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do > is invalidateAllCachedTables when execute set Command: > {code} > val isInvalidateAllCachedTablesKeys = Set( > SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key, > SQLConf.DEFAULT_SIZE_IN_BYTES.key > ) > sparkSession.conf.set(key, value) > if (isInvalidateAllCachedTablesKeys.contains(key)) { > sparkSession.sessionState.catalog.invalidateAllCachedTables() > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25740) Set some configuration need invalidateStatsCache
[ https://issues.apache.org/jira/browse/SPARK-25740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651264#comment-16651264 ] Apache Spark commented on SPARK-25740: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/22743 > Set some configuration need invalidateStatsCache > > > Key: SPARK-25740 > URL: https://issues.apache.org/jira/browse/SPARK-25740 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:sql} > # spark-sql > create table t1 (a int) stored as parquet; > create table t2 (a int) stored as parquet; > insert into table t1 values (1); > insert into table t2 values (1); > explain select * from t1, t2 where t1.a = t2.a; > exit; > spark-sql > explain select * from t1, t2 where t1.a = t2.a; > -- SortMergeJoin > set spark.sql.statistics.fallBackToHdfs=true; > explain select * from t1, t2 where t1.a = t2.a; > -- SortMergeJoin, it should be BroadcastHashJoin > exit; > spark-sql > set spark.sql.statistics.fallBackToHdfs=true; > explain select * from t1, t2 where t1.a = t2.a; > -- BroadcastHashJoin > {code} > We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do > is invalidateAllCachedTables when execute set Command: > {code} > val isInvalidateAllCachedTablesKeys = Set( > SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key, > SQLConf.DEFAULT_SIZE_IN_BYTES.key > ) > sparkSession.conf.set(key, value) > if (isInvalidateAllCachedTablesKeys.contains(key)) { > sparkSession.sessionState.catalog.invalidateAllCachedTables() > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25740) Set some configuration need invalidateStatsCache
[ https://issues.apache.org/jira/browse/SPARK-25740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25740: Assignee: (was: Apache Spark) > Set some configuration need invalidateStatsCache > > > Key: SPARK-25740 > URL: https://issues.apache.org/jira/browse/SPARK-25740 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:sql} > # spark-sql > create table t1 (a int) stored as parquet; > create table t2 (a int) stored as parquet; > insert into table t1 values (1); > insert into table t2 values (1); > explain select * from t1, t2 where t1.a = t2.a; > exit; > spark-sql > explain select * from t1, t2 where t1.a = t2.a; > -- SortMergeJoin > set spark.sql.statistics.fallBackToHdfs=true; > explain select * from t1, t2 where t1.a = t2.a; > -- SortMergeJoin, it should be BroadcastHashJoin > exit; > spark-sql > set spark.sql.statistics.fallBackToHdfs=true; > explain select * from t1, t2 where t1.a = t2.a; > -- BroadcastHashJoin > {code} > We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do > is invalidateAllCachedTables when execute set Command: > {code} > val isInvalidateAllCachedTablesKeys = Set( > SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key, > SQLConf.DEFAULT_SIZE_IN_BYTES.key > ) > sparkSession.conf.set(key, value) > if (isInvalidateAllCachedTablesKeys.contains(key)) { > sparkSession.sessionState.catalog.invalidateAllCachedTables() > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25740) Set some configuration need invalidateStatsCache
[ https://issues.apache.org/jira/browse/SPARK-25740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-25740: Description: How to reproduce: {code:sql} # spark-sql create table t1 (a int) stored as parquet; create table t2 (a int) stored as parquet; insert into table t1 values (1); insert into table t2 values (1); explain select * from t1, t2 where t1.a = t2.a; exit; spark-sql set spark.sql.statistics.fallBackToHdfs=true; explain select * from t1, t2 where t1.a = t2.a; -- BroadcastHashJoin exit; # spark-sql explain select * from t1, t2 where t1.a = t2.a; -- SortMergeJoin set spark.sql.statistics.fallBackToHdfs=true; explain select * from t1, t2 where t1.a = t2.a; -- SortMergeJoin, it should be BroadcastHashJoin exit; {code} We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do is invalidateAllCachedTables when execute set Command: {code:java} val isInvalidateAllCachedTablesKeys = Set( SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key, SQLConf.DEFAULT_SIZE_IN_BYTES.key ) sparkSession.conf.set(key, value) if (isInvalidateAllCachedTablesKeys.contains(key)) { sparkSession.sessionState.catalog.invalidateAllCachedTables() } {code} was: How to reproduce: {code:sql} # spark-sql create table t1 (a int) stored as parquet; create table t2 (a int) stored as parquet; insert into table t1 values (1); insert into table t2 values (1); explain select * from t1, t2 where t1.a = t2.a; exit; spark-sql explain select * from t1, t2 where t1.a = t2.a; -- SortMergeJoin set spark.sql.statistics.fallBackToHdfs=true; explain select * from t1, t2 where t1.a = t2.a; -- SortMergeJoin, it should be BroadcastHashJoin exit; spark-sql set spark.sql.statistics.fallBackToHdfs=true; explain select * from t1, t2 where t1.a = t2.a; -- BroadcastHashJoin {code} We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do is invalidateAllCachedTables when execute set Command: {code} val isInvalidateAllCachedTablesKeys = Set( SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key, SQLConf.DEFAULT_SIZE_IN_BYTES.key ) sparkSession.conf.set(key, value) if (isInvalidateAllCachedTablesKeys.contains(key)) { sparkSession.sessionState.catalog.invalidateAllCachedTables() } {code} > Set some configuration need invalidateStatsCache > > > Key: SPARK-25740 > URL: https://issues.apache.org/jira/browse/SPARK-25740 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:sql} > # spark-sql > create table t1 (a int) stored as parquet; > create table t2 (a int) stored as parquet; > insert into table t1 values (1); > insert into table t2 values (1); > explain select * from t1, t2 where t1.a = t2.a; > exit; > spark-sql > set spark.sql.statistics.fallBackToHdfs=true; > explain select * from t1, t2 where t1.a = t2.a; > -- BroadcastHashJoin > exit; > # spark-sql > explain select * from t1, t2 where t1.a = t2.a; > -- SortMergeJoin > set spark.sql.statistics.fallBackToHdfs=true; > explain select * from t1, t2 where t1.a = t2.a; > -- SortMergeJoin, it should be BroadcastHashJoin > exit; > {code} > We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do > is invalidateAllCachedTables when execute set Command: > {code:java} > val isInvalidateAllCachedTablesKeys = Set( > SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key, > SQLConf.DEFAULT_SIZE_IN_BYTES.key > ) > sparkSession.conf.set(key, value) > if (isInvalidateAllCachedTablesKeys.contains(key)) { > sparkSession.sessionState.catalog.invalidateAllCachedTables() > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25740) Set some configuration need invalidateStatsCache
[ https://issues.apache.org/jira/browse/SPARK-25740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-25740: Description: How to reproduce: {code:sql} # spark-sql create table t1 (a int) stored as parquet; create table t2 (a int) stored as parquet; insert into table t1 values (1); insert into table t2 values (1); exit; spark-sql set spark.sql.statistics.fallBackToHdfs=true; explain select * from t1, t2 where t1.a = t2.a; -- BroadcastHashJoin exit; spark-sql explain select * from t1, t2 where t1.a = t2.a; -- SortMergeJoin set spark.sql.statistics.fallBackToHdfs=true; explain select * from t1, t2 where t1.a = t2.a; -- SortMergeJoin, it should be BroadcastHashJoin exit; {code} We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do is invalidateAllCachedTables when execute set Command: {code:java} val isInvalidateAllCachedTablesKeys = Set( SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key, SQLConf.DEFAULT_SIZE_IN_BYTES.key ) sparkSession.conf.set(key, value) if (isInvalidateAllCachedTablesKeys.contains(key)) { sparkSession.sessionState.catalog.invalidateAllCachedTables() } {code} was: How to reproduce: {code:sql} # spark-sql create table t1 (a int) stored as parquet; create table t2 (a int) stored as parquet; insert into table t1 values (1); insert into table t2 values (1); explain select * from t1, t2 where t1.a = t2.a; exit; spark-sql set spark.sql.statistics.fallBackToHdfs=true; explain select * from t1, t2 where t1.a = t2.a; -- BroadcastHashJoin exit; # spark-sql explain select * from t1, t2 where t1.a = t2.a; -- SortMergeJoin set spark.sql.statistics.fallBackToHdfs=true; explain select * from t1, t2 where t1.a = t2.a; -- SortMergeJoin, it should be BroadcastHashJoin exit; {code} We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do is invalidateAllCachedTables when execute set Command: {code:java} val isInvalidateAllCachedTablesKeys = Set( SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key, SQLConf.DEFAULT_SIZE_IN_BYTES.key ) sparkSession.conf.set(key, value) if (isInvalidateAllCachedTablesKeys.contains(key)) { sparkSession.sessionState.catalog.invalidateAllCachedTables() } {code} > Set some configuration need invalidateStatsCache > > > Key: SPARK-25740 > URL: https://issues.apache.org/jira/browse/SPARK-25740 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:sql} > # spark-sql > create table t1 (a int) stored as parquet; > create table t2 (a int) stored as parquet; > insert into table t1 values (1); > insert into table t2 values (1); > exit; > spark-sql > set spark.sql.statistics.fallBackToHdfs=true; > explain select * from t1, t2 where t1.a = t2.a; > -- BroadcastHashJoin > exit; > spark-sql > explain select * from t1, t2 where t1.a = t2.a; > -- SortMergeJoin > set spark.sql.statistics.fallBackToHdfs=true; > explain select * from t1, t2 where t1.a = t2.a; > -- SortMergeJoin, it should be BroadcastHashJoin > exit; > {code} > We need {{LogicalPlanStats.invalidateStatsCache}}, but seems only we can do > is invalidateAllCachedTables when execute set Command: > {code:java} > val isInvalidateAllCachedTablesKeys = Set( > SQLConf.ENABLE_FALL_BACK_TO_HDFS_FOR_STATS.key, > SQLConf.DEFAULT_SIZE_IN_BYTES.key > ) > sparkSession.conf.set(key, value) > if (isInvalidateAllCachedTablesKeys.contains(key)) { > sparkSession.sessionState.catalog.invalidateAllCachedTables() > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22386) Data Source V2 improvements
[ https://issues.apache.org/jira/browse/SPARK-22386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-22386: - Target Version/s: 3.0.0 (was: 2.5.0) > Data Source V2 improvements > --- > > Key: SPARK-22386 > URL: https://issues.apache.org/jira/browse/SPARK-22386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Priority: Major > Labels: releasenotes > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24882) data source v2 API improvement
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651303#comment-16651303 ] Hyukjin Kwon commented on SPARK-24882: -- [~cloud_fan], should we put this JIRA under SPARK-22386? > data source v2 API improvement > -- > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.0 > > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 API, isolate the stateull part of the API, think of better naming > of some interfaces. Details please see the attached google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal
[ https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651301#comment-16651301 ] Mridul Muralidharan commented on SPARK-25732: - [~vanzin] With long running applications (not necessarily streaming) needing access (read/write) to various data sources (not just hdfs), is there a way to do this even assuming livy rpc was augmented to support it ? For example, livy server would not know which data sources to fetch tokens for (since that will be part of user application jars/config). For the specific usecase [~mgaido] detailed, proxy principal (foo)/keytab would be present and distinct from zeppelin or livy principal/keytab. The 'proxy' part would simply be for livy to submit the application as the proxied user 'foo' - once application comes up, it will behave as though it was submitted by the user 'foo' with specified keytab (from hdfs) - acquire/renew tokens for user 'foo' from its keytab. [~tgraves] I do share your concern; unfortunately for the usecase Marco is targeting, there does not seem to be an alternative; livy server is man in the middle here (w.r.t submitting client). Having said that, if there is an alternative, I would definitely prefer that over sharing keytabs - even if it is over secured hdfs. > Allow specifying a keytab/principal for proxy user for token renewal > - > > Key: SPARK-25732 > URL: https://issues.apache.org/jira/browse/SPARK-25732 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > As of now, application submitted with proxy-user fail after 2 week due to the > lack of token renewal. In order to enable it, we need the the > keytab/principal of the impersonated user to be specified, in order to have > them available for the token renewal. > This JIRA proposes to add two parameters {{--proxy-user-principal}} and > {{--proxy-user-keytab}}, and the last letting a keytab being specified also > in a distributed FS, so that applications can be submitted by servers (eg. > Livy, Zeppelin) without needing all users' principals being on that machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25729) It is better to replace `minPartitions` with `defaultParallelism` , when `minPartitions` is less than `defaultParallelism`
[ https://issues.apache.org/jira/browse/SPARK-25729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liuxian updated SPARK-25729: Description: In ‘WholeTextFileRDD’,when `minPartitions` is less than `defaultParallelism`, it is better to replace `minPartitions` with `defaultParallelism` , because this can make better use of resources and improve parallelism. was: In ‘WholeTextFileRDD’,when `minPartitions` is less than `defaultParallelism`, it is better to replace `minPartitions` with `defaultParallelism` , because this can make better use of resources and improve concurrency. > It is better to replace `minPartitions` with `defaultParallelism` , when > `minPartitions` is less than `defaultParallelism` > -- > > Key: SPARK-25729 > URL: https://issues.apache.org/jira/browse/SPARK-25729 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: liuxian >Priority: Minor > > In ‘WholeTextFileRDD’,when `minPartitions` is less than `defaultParallelism`, > it is better to replace `minPartitions` with `defaultParallelism` , because > this can make better use of resources and improve parallelism. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10816) EventTime based sessionization
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651404#comment-16651404 ] Jungtaek Lim commented on SPARK-10816: -- Update: I've crafted another performance test for testing same query with data pattern. [https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/tree/benchmarking-SPARK-10816] I've separated packages for both data pattern just for simplicity. Classnames are same. Data pattern 1: plenty of rows in same session [https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/tree/benchmarking-SPARK-10816/src/main/scala/com/hortonworks/spark/benchmark/streaming/sessionwindow/plenty_of_rows_in_session] Data pattern 2: plenty of sessions [https://github.com/HeartSaVioR/iot-trucking-app-spark-structured-streaming/tree/benchmarking-SPARK-10816/src/main/scala/com/hortonworks/spark/benchmark/streaming/sessionwindow/plenty_of_sessions] While running benchmark with data pattern 2, I've found some performance hits on my patch so made some fixes as well. Most of the fixes were reducing the number of codegen: but there's also a major fix: made pre-merging sessions in local partition being optional. It seriously harms the performance with data pattern 2. The patch still lacks with state sub-optimal. I guess it is now the major bottleneck on my patch, so wrapping my head to find good alternatives. Baidu's list state would be the one of, since I realized \[3] might put more deltas as well as requires more operations. > EventTime based sessionization > -- > > Key: SPARK-10816 > URL: https://issues.apache.org/jira/browse/SPARK-10816 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Reynold Xin >Priority: Major > Attachments: SPARK-10816 Support session window natively.pdf, Session > Window Support For Structure Streaming.pdf > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18127) Add hooks and extension points to Spark
[ https://issues.apache.org/jira/browse/SPARK-18127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651427#comment-16651427 ] Ayush Nigam commented on SPARK-18127: - Is there any documentation on how to use Hooks for Spark? > Add hooks and extension points to Spark > --- > > Key: SPARK-18127 > URL: https://issues.apache.org/jira/browse/SPARK-18127 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Srinath >Assignee: Sameer Agarwal >Priority: Major > Fix For: 2.2.0 > > > As a Spark user I want to be able to customize my spark session. I currently > want to be able to do the following things: > # I want to be able to add custom analyzer rules. This allows me to implement > my own logical constructs; an example of this could be a recursive operator. > # I want to be able to add my own analysis checks. This allows me to catch > problems with spark plans early on. An example of this can be some datasource > specific checks. > # I want to be able to add my own optimizations. This allows me to optimize > plans in different ways, for instance when you use a very different cluster > (for example a one-node X1 instance). This supersedes the current > {{spark.experimental}} methods > # I want to be able to add my own planning strategies. This supersedes the > current {{spark.experimental}} methods. This allows me to plan my own > physical plan, an example of this would to plan my own heavily integrated > data source (CarbonData for example). > # I want to be able to use my own customized SQL constructs. An example of > this would supporting my own dialect, or be able to add constructs to the > current SQL language. I should not have to implement a complete parse, and > should be able to delegate to an underlying parser. > # I want to be able to track modifications and calls to the external catalog. > I want this API to be stable. This allows me to do synchronize with other > systems. > This API should modify the SparkSession when the session gets started, and it > should NOT change the session in flight. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25741) Long URLs are not rendered properly in web UI
[ https://issues.apache.org/jira/browse/SPARK-25741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-25741: --- Summary: Long URLs are not rendered properly in web UI (was: Long URL are not rendered properly in web UI) > Long URLs are not rendered properly in web UI > - > > Key: SPARK-25741 > URL: https://issues.apache.org/jira/browse/SPARK-25741 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Minor > > When the URL for description column in the table of job/stage page is long, > WebUI doesn't render it properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25741) Long URL are not rendered properly in web UI
Gengliang Wang created SPARK-25741: -- Summary: Long URL are not rendered properly in web UI Key: SPARK-25741 URL: https://issues.apache.org/jira/browse/SPARK-25741 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.4.0 Reporter: Gengliang Wang When the URL for description column in the table of job/stage page is long, WebUI doesn't render it properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25741) Long URLs are not rendered properly in web UI
[ https://issues.apache.org/jira/browse/SPARK-25741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25741: Assignee: Apache Spark > Long URLs are not rendered properly in web UI > - > > Key: SPARK-25741 > URL: https://issues.apache.org/jira/browse/SPARK-25741 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > > When the URL for description column in the table of job/stage page is long, > WebUI doesn't render it properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25741) Long URLs are not rendered properly in web UI
[ https://issues.apache.org/jira/browse/SPARK-25741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651438#comment-16651438 ] Apache Spark commented on SPARK-25741: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/22744 > Long URLs are not rendered properly in web UI > - > > Key: SPARK-25741 > URL: https://issues.apache.org/jira/browse/SPARK-25741 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Minor > > When the URL for description column in the table of job/stage page is long, > WebUI doesn't render it properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25741) Long URLs are not rendered properly in web UI
[ https://issues.apache.org/jira/browse/SPARK-25741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651437#comment-16651437 ] Apache Spark commented on SPARK-25741: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/22744 > Long URLs are not rendered properly in web UI > - > > Key: SPARK-25741 > URL: https://issues.apache.org/jira/browse/SPARK-25741 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Minor > > When the URL for description column in the table of job/stage page is long, > WebUI doesn't render it properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25741) Long URLs are not rendered properly in web UI
[ https://issues.apache.org/jira/browse/SPARK-25741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25741: Assignee: (was: Apache Spark) > Long URLs are not rendered properly in web UI > - > > Key: SPARK-25741 > URL: https://issues.apache.org/jira/browse/SPARK-25741 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.0 >Reporter: Gengliang Wang >Priority: Minor > > When the URL for description column in the table of job/stage page is long, > WebUI doesn't render it properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code
[ https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651480#comment-16651480 ] Marco Gaido commented on SPARK-25728: - Thanks [~kiszk]. I will check it ASAP, thanks for your work on this. > SPIP: Structured Intermediate Representation (Tungsten IR) for generating > Java code > --- > > Key: SPARK-25728 > URL: https://issues.apache.org/jira/browse/SPARK-25728 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > This JIRA entry is to start a discussion about adding structure intermediate > representation for generating Java code from a program using DataFrame or > Dataset API, in addition to the current String-based representation. > This addition is based on the discussions in [a > thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196]. > Please feel free to comment on this JIRA entry or [Google > Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing], > too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25742) Is there a way to pass the Azure blob storage credentials to the spark for k8s init-container?
Oscar Bonilla created SPARK-25742: - Summary: Is there a way to pass the Azure blob storage credentials to the spark for k8s init-container? Key: SPARK-25742 URL: https://issues.apache.org/jira/browse/SPARK-25742 Project: Spark Issue Type: Question Components: Kubernetes Affects Versions: 2.3.2 Reporter: Oscar Bonilla I'm trying to run spark on a kubernetes cluster in Azure. The idea is to store the Spark application jars and dependencies in a container in Azure Blob Storage. I've tried to do this with a public container and this works OK, but when having a private Blob Storage container, the spark-init init container doesn't download the jars. The equivalent in AWS S3 is as simple as adding the key_id and secret as environment variables, but I don't see how to do this for Azure Blob Storage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25743) New executors are not launched for kubernetes spark thrift on deleting existing executors
neenu created SPARK-25743: - Summary: New executors are not launched for kubernetes spark thrift on deleting existing executors Key: SPARK-25743 URL: https://issues.apache.org/jira/browse/SPARK-25743 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.2.0 Environment: Physical lab configurations. 8 baremetal servers, Each 56 Cores, 384GB RAM, RHEL 7.4 Kernel : 3.10.0-862.9.1.el7.x86_64 redhat-release-server.x86_64 7.4-18.el7 Kubernetes info: Client Version: version.Info\{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info\{Major:"1", Minor:"10", GitVersion:"v1.10.2", GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", BuildDate:"2018-04-27T09:10:24Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"} Reporter: neenu Launched spark thrift in kubernetes cluster with dynamic allocation enabled. Configurations set : spark.executor.memory=35g spark.executor.cores=8 spark.dynamicAllocation.enabled=true spark.dynamicAllocation.executorIdleTimeout=10 spark.dynamicAllocation.cachedExecutorIdleTimeout=15 spark.driver.memory=10g spark.driver.cores=4 spark.sql.crossJoin.enabled=true spark.sql.starJoinOptimization=true spark.sql.codegen=true spark.rpc.numRetries=5 spark.rpc.retry.wait=5 spark.sql.broadcastTimeout=1200 spark.network.timeout=1800 spark.dynamicAllocation.maxExecutors=15 spark.kubernetes.allocation.batch.size=2 spark.kubernetes.allocation.batch.delay=9 spark.serializer=org.apache.spark.serializer.KryoSerializer spark.kubernetes.node.selector.is_control=false Tried to run TPCDS queries , on a 1TB parquet snappy data . Found that as the execution progress, the tasks are done by a single executor ( executor 53 ) and no new executors are getting spawned, even though there is enough resources to spawn more executors. Tried to manually delete the executor pod 53 and saw that no new executor has been spawned to replace the one which is running. Attcahed the -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25743) New executors are not launched for kubernetes spark thrift on deleting existing executors
[ https://issues.apache.org/jira/browse/SPARK-25743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] neenu updated SPARK-25743: -- Attachment: query_0_correct.sql > New executors are not launched for kubernetes spark thrift on deleting > existing executors > -- > > Key: SPARK-25743 > URL: https://issues.apache.org/jira/browse/SPARK-25743 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.2.0 > Environment: Physical lab configurations. > 8 baremetal servers, > Each 56 Cores, 384GB RAM, RHEL 7.4 > Kernel : 3.10.0-862.9.1.el7.x86_64 > redhat-release-server.x86_64 7.4-18.el7 > > > Kubernetes info: > Client Version: version.Info\{Major:"1", Minor:"10", GitVersion:"v1.10.2", > GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", > BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", > Platform:"linux/amd64"} > Server Version: version.Info\{Major:"1", Minor:"10", GitVersion:"v1.10.2", > GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", > BuildDate:"2018-04-27T09:10:24Z", GoVersion:"go1.9.3", Compiler:"gc", > Platform:"linux/amd64"} >Reporter: neenu >Priority: Major > Attachments: driver.log, query_0_correct.sql > > > Launched spark thrift in kubernetes cluster with dynamic allocation enabled. > Configurations set : > spark.executor.memory=35g > spark.executor.cores=8 > spark.dynamicAllocation.enabled=true > spark.dynamicAllocation.executorIdleTimeout=10 > spark.dynamicAllocation.cachedExecutorIdleTimeout=15 > spark.driver.memory=10g > spark.driver.cores=4 > spark.sql.crossJoin.enabled=true > spark.sql.starJoinOptimization=true > spark.sql.codegen=true > spark.rpc.numRetries=5 > spark.rpc.retry.wait=5 > spark.sql.broadcastTimeout=1200 > spark.network.timeout=1800 > spark.dynamicAllocation.maxExecutors=15 > spark.kubernetes.allocation.batch.size=2 > spark.kubernetes.allocation.batch.delay=9 > spark.serializer=org.apache.spark.serializer.KryoSerializer > spark.kubernetes.node.selector.is_control=false > Tried to run TPCDS queries , on a 1TB parquet snappy data . > Found that as the execution progress, the tasks are done by a single executor > ( executor 53 ) and no new executors are getting spawned, even though there > is enough resources to spawn more executors. > > Tried to manually delete the executor pod 53 and saw that no new executor has > been spawned to replace the one which is running. > Attcahed the -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25743) New executors are not launched for kubernetes spark thrift on deleting existing executors
[ https://issues.apache.org/jira/browse/SPARK-25743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651514#comment-16651514 ] neenu commented on SPARK-25743: --- Also attached the tpcds query list executed [^query_0_correct.sql] > New executors are not launched for kubernetes spark thrift on deleting > existing executors > -- > > Key: SPARK-25743 > URL: https://issues.apache.org/jira/browse/SPARK-25743 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.2.0 > Environment: Physical lab configurations. > 8 baremetal servers, > Each 56 Cores, 384GB RAM, RHEL 7.4 > Kernel : 3.10.0-862.9.1.el7.x86_64 > redhat-release-server.x86_64 7.4-18.el7 > > > Kubernetes info: > Client Version: version.Info\{Major:"1", Minor:"10", GitVersion:"v1.10.2", > GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", > BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", > Platform:"linux/amd64"} > Server Version: version.Info\{Major:"1", Minor:"10", GitVersion:"v1.10.2", > GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", > BuildDate:"2018-04-27T09:10:24Z", GoVersion:"go1.9.3", Compiler:"gc", > Platform:"linux/amd64"} >Reporter: neenu >Priority: Major > Attachments: driver.log > > > Launched spark thrift in kubernetes cluster with dynamic allocation enabled. > Configurations set : > spark.executor.memory=35g > spark.executor.cores=8 > spark.dynamicAllocation.enabled=true > spark.dynamicAllocation.executorIdleTimeout=10 > spark.dynamicAllocation.cachedExecutorIdleTimeout=15 > spark.driver.memory=10g > spark.driver.cores=4 > spark.sql.crossJoin.enabled=true > spark.sql.starJoinOptimization=true > spark.sql.codegen=true > spark.rpc.numRetries=5 > spark.rpc.retry.wait=5 > spark.sql.broadcastTimeout=1200 > spark.network.timeout=1800 > spark.dynamicAllocation.maxExecutors=15 > spark.kubernetes.allocation.batch.size=2 > spark.kubernetes.allocation.batch.delay=9 > spark.serializer=org.apache.spark.serializer.KryoSerializer > spark.kubernetes.node.selector.is_control=false > Tried to run TPCDS queries , on a 1TB parquet snappy data . > Found that as the execution progress, the tasks are done by a single executor > ( executor 53 ) and no new executors are getting spawned, even though there > is enough resources to spawn more executors. > > Tried to manually delete the executor pod 53 and saw that no new executor has > been spawned to replace the one which is running. > Attcahed the -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25743) New executors are not launched for kubernetes spark thrift on deleting existing executors
[ https://issues.apache.org/jira/browse/SPARK-25743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] neenu updated SPARK-25743: -- Attachment: driver.log > New executors are not launched for kubernetes spark thrift on deleting > existing executors > -- > > Key: SPARK-25743 > URL: https://issues.apache.org/jira/browse/SPARK-25743 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.2.0 > Environment: Physical lab configurations. > 8 baremetal servers, > Each 56 Cores, 384GB RAM, RHEL 7.4 > Kernel : 3.10.0-862.9.1.el7.x86_64 > redhat-release-server.x86_64 7.4-18.el7 > > > Kubernetes info: > Client Version: version.Info\{Major:"1", Minor:"10", GitVersion:"v1.10.2", > GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", > BuildDate:"2018-04-27T09:22:21Z", GoVersion:"go1.9.3", Compiler:"gc", > Platform:"linux/amd64"} > Server Version: version.Info\{Major:"1", Minor:"10", GitVersion:"v1.10.2", > GitCommit:"81753b10df112992bf51bbc2c2f85208aad78335", GitTreeState:"clean", > BuildDate:"2018-04-27T09:10:24Z", GoVersion:"go1.9.3", Compiler:"gc", > Platform:"linux/amd64"} >Reporter: neenu >Priority: Major > Attachments: driver.log, query_0_correct.sql > > > Launched spark thrift in kubernetes cluster with dynamic allocation enabled. > Configurations set : > spark.executor.memory=35g > spark.executor.cores=8 > spark.dynamicAllocation.enabled=true > spark.dynamicAllocation.executorIdleTimeout=10 > spark.dynamicAllocation.cachedExecutorIdleTimeout=15 > spark.driver.memory=10g > spark.driver.cores=4 > spark.sql.crossJoin.enabled=true > spark.sql.starJoinOptimization=true > spark.sql.codegen=true > spark.rpc.numRetries=5 > spark.rpc.retry.wait=5 > spark.sql.broadcastTimeout=1200 > spark.network.timeout=1800 > spark.dynamicAllocation.maxExecutors=15 > spark.kubernetes.allocation.batch.size=2 > spark.kubernetes.allocation.batch.delay=9 > spark.serializer=org.apache.spark.serializer.KryoSerializer > spark.kubernetes.node.selector.is_control=false > Tried to run TPCDS queries , on a 1TB parquet snappy data . > Found that as the execution progress, the tasks are done by a single executor > ( executor 53 ) and no new executors are getting spawned, even though there > is enough resources to spawn more executors. > > Tried to manually delete the executor pod 53 and saw that no new executor has > been spawned to replace the one which is running. > Attcahed the -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25744) Allow kubernetes integration tests to be run against a real cluster.
Prashant Sharma created SPARK-25744: --- Summary: Allow kubernetes integration tests to be run against a real cluster. Key: SPARK-25744 URL: https://issues.apache.org/jira/browse/SPARK-25744 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.0.0 Reporter: Prashant Sharma Currently, tests can only run against a minikube cluster, testing against a real cluster gives more flexibility in writing tests with more number of executors and resources. It will also be helpful, if minikube is unavailable for testing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList
[ https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651546#comment-16651546 ] Apache Spark commented on SPARK-21402: -- User 'vofque' has created a pull request for this issue: https://github.com/apache/spark/pull/22745 > Java encoders - switch fields on collectAsList > -- > > Key: SPARK-21402 > URL: https://issues.apache.org/jira/browse/SPARK-21402 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: mac os > spark 2.1.1 > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121 >Reporter: Tom >Priority: Major > > I have the following schema in a dataset - > root > |-- userId: string (nullable = true) > |-- data: map (nullable = true) > ||-- key: string > ||-- value: struct (valueContainsNull = true) > |||-- startTime: long (nullable = true) > |||-- endTime: long (nullable = true) > |-- offset: long (nullable = true) > And I have the following classes (+ setter and getters which I omitted for > simplicity) - > > {code:java} > public class MyClass { > private String userId; > private Map data; > private Long offset; > } > public class MyDTO { > private long startTime; > private long endTime; > } > {code} > I collect the result the following way - > {code:java} > Encoder myClassEncoder = Encoders.bean(MyClass.class); > Dataset results = raw_df.as(myClassEncoder); > List lst = results.collectAsList(); > {code} > > I do several calculations to get the result I want and the result is correct > all through the way before I collect it. > This is the result for - > {code:java} > results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false); > {code} > |data[2017-07-01].startTime|data[2017-07-01].endTime| > +-+--+ > |1498854000|1498870800 | > This is the result after collecting the reuslts for - > {code:java} > MyClass userData = results.collectAsList().get(0); > MyDTO userDTO = userData.getData().get("2017-07-01"); > System.out.println("userDTO startTime: " + userDTO.getStartTime()); > System.out.println("userDTO endTime: " + userDTO.getEndTime()); > {code} > -- > data startTime: 1498870800 > data endTime: 1498854000 > I tend to believe it is a spark issue. Would love any suggestions on how to > bypass it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21402) Java encoders - switch fields on collectAsList
[ https://issues.apache.org/jira/browse/SPARK-21402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651550#comment-16651550 ] Apache Spark commented on SPARK-21402: -- User 'vofque' has created a pull request for this issue: https://github.com/apache/spark/pull/22745 > Java encoders - switch fields on collectAsList > -- > > Key: SPARK-21402 > URL: https://issues.apache.org/jira/browse/SPARK-21402 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: mac os > spark 2.1.1 > Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_121 >Reporter: Tom >Priority: Major > > I have the following schema in a dataset - > root > |-- userId: string (nullable = true) > |-- data: map (nullable = true) > ||-- key: string > ||-- value: struct (valueContainsNull = true) > |||-- startTime: long (nullable = true) > |||-- endTime: long (nullable = true) > |-- offset: long (nullable = true) > And I have the following classes (+ setter and getters which I omitted for > simplicity) - > > {code:java} > public class MyClass { > private String userId; > private Map data; > private Long offset; > } > public class MyDTO { > private long startTime; > private long endTime; > } > {code} > I collect the result the following way - > {code:java} > Encoder myClassEncoder = Encoders.bean(MyClass.class); > Dataset results = raw_df.as(myClassEncoder); > List lst = results.collectAsList(); > {code} > > I do several calculations to get the result I want and the result is correct > all through the way before I collect it. > This is the result for - > {code:java} > results.select(results.col("data").getField("2017-07-01").getField("startTime")).show(false); > {code} > |data[2017-07-01].startTime|data[2017-07-01].endTime| > +-+--+ > |1498854000|1498870800 | > This is the result after collecting the reuslts for - > {code:java} > MyClass userData = results.collectAsList().get(0); > MyDTO userDTO = userData.getData().get("2017-07-01"); > System.out.println("userDTO startTime: " + userDTO.getStartTime()); > System.out.println("userDTO endTime: " + userDTO.getEndTime()); > {code} > -- > data startTime: 1498870800 > data endTime: 1498854000 > I tend to believe it is a spark issue. Would love any suggestions on how to > bypass it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18832) Spark SQL: Thriftserver unable to run a registered Hive UDTF
[ https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651565#comment-16651565 ] Yuming Wang commented on SPARK-18832: - [~roadster11x] I can't reproduce. > Spark SQL: Thriftserver unable to run a registered Hive UDTF > > > Key: SPARK-18832 > URL: https://issues.apache.org/jira/browse/SPARK-18832 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2 > Environment: HDP: 2.5 > Spark: 2.0.0 >Reporter: Lokesh Yadav >Priority: Major > Attachments: SampleUDTF.java > > > Spark Thriftserver is unable to run a HiveUDTF. > It throws the error that it is unable to find the functions although the > function registration succeeds and the funtions does show up in the list > output by {{show functions}}. > I am using a Hive UDTF, registering it using a jar placed on my local > machine. Calling it using the following commands: > //Registering the functions, this command succeeds. > {{CREATE FUNCTION SampleUDTF AS > 'com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF' USING JAR > '/root/spark_files/experiments-1.2.jar';}} > //Thriftserver is able to look up the functuion, on this command: > {{DESCRIBE FUNCTION SampleUDTF;}} > {quote} > {noformat} > Output: > +---+--+ > | function_desc | > +---+--+ > | Function: default.SampleUDTF | > | Class: com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF | > | Usage: N/A. | > +---+--+ > {noformat} > {quote} > // Calling the function: > {{SELECT SampleUDTF('Paris');}} > bq. Output of the above command: Error: > org.apache.spark.sql.AnalysisException: Undefined function: 'SampleUDTF'. > This function is neither a registered temporary function nor a permanent > function registered in the database 'default'.; line 1 pos 7 (state=,code=0) > I have also tried with using a non-local (on hdfs) jar, but I get the same > error. > My environment: HDP 2.5 with spark 2.0.0 > I have attached the class file for the UDTF I am using in testing this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24882) data source v2 API improvement
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24882: Issue Type: Sub-task (was: Improvement) Parent: SPARK-22386 > data source v2 API improvement > -- > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.4.0 > > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 API, isolate the stateull part of the API, think of better naming > of some interfaces. Details please see the attached google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24882) data source v2 API improvement
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24882: Fix Version/s: (was: 2.4.0) 3.0.0 > data source v2 API improvement > -- > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 API, isolate the stateull part of the API, think of better naming > of some interfaces. Details please see the attached google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24499) Documentation improvement of Spark core and SQL
[ https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24499: Assignee: Apache Spark > Documentation improvement of Spark core and SQL > --- > > Key: SPARK-24499 > URL: https://issues.apache.org/jira/browse/SPARK-24499 > Project: Spark > Issue Type: New Feature > Components: Documentation, Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > The current documentation in Apache Spark lacks enough code examples and > tips. If needed, we should also split the page of > https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple > separate pages like what we did for > https://spark.apache.org/docs/latest/ml-guide.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24499) Documentation improvement of Spark core and SQL
[ https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651598#comment-16651598 ] Apache Spark commented on SPARK-24499: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/22746 > Documentation improvement of Spark core and SQL > --- > > Key: SPARK-24499 > URL: https://issues.apache.org/jira/browse/SPARK-24499 > Project: Spark > Issue Type: New Feature > Components: Documentation, Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > The current documentation in Apache Spark lacks enough code examples and > tips. If needed, we should also split the page of > https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple > separate pages like what we did for > https://spark.apache.org/docs/latest/ml-guide.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24499) Documentation improvement of Spark core and SQL
[ https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24499: Assignee: (was: Apache Spark) > Documentation improvement of Spark core and SQL > --- > > Key: SPARK-24499 > URL: https://issues.apache.org/jira/browse/SPARK-24499 > Project: Spark > Issue Type: New Feature > Components: Documentation, Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > The current documentation in Apache Spark lacks enough code examples and > tips. If needed, we should also split the page of > https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple > separate pages like what we did for > https://spark.apache.org/docs/latest/ml-guide.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25579) Use quoted attribute names if needed in pushed ORC predicates
[ https://issues.apache.org/jira/browse/SPARK-25579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25579. -- Resolution: Fixed Fix Version/s: 2.4.0 3.0.0 Issue resolved by pull request 22597 [https://github.com/apache/spark/pull/22597] > Use quoted attribute names if needed in pushed ORC predicates > - > > Key: SPARK-25579 > URL: https://issues.apache.org/jira/browse/SPARK-25579 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Blocker > Fix For: 3.0.0, 2.4.0 > > > This issue aims to fix an ORC performance regression at Spark 2.4.0 RCs from > Spark 2.3.2. For column names with `.`, the pushed predicates are ignored. > *Test Data* > {code:java} > scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot") > scala> df.write.mode("overwrite").orc("/tmp/orc") > {code} > *Spark 2.3.2* > {code:java} > scala> spark.sql("set spark.sql.orc.impl=native") > scala> spark.sql("set spark.sql.orc.filterPushdown=true") > scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < > 10").show) > ++ > |col.with.dot| > ++ > | 1| > | 8| > ++ > Time taken: 1486 ms > scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < > 10").show) > ++ > |col.with.dot| > ++ > | 1| > | 8| > ++ > Time taken: 163 ms > {code} > *Spark 2.4.0 RC2* > {code:java} > scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < > 10").show) > ++ > |col.with.dot| > ++ > | 1| > | 8| > ++ > Time taken: 4087 ms > scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < > 10").show) > ++ > |col.with.dot| > ++ > | 1| > | 8| > ++ > Time taken: 1998 ms > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25745) docker-image-tool.sh ignores errors from Docker
Rob Vesse created SPARK-25745: - Summary: docker-image-tool.sh ignores errors from Docker Key: SPARK-25745 URL: https://issues.apache.org/jira/browse/SPARK-25745 Project: Spark Issue Type: Bug Components: Deploy, Kubernetes Affects Versions: 2.3.2, 2.3.1, 2.3.0 Reporter: Rob Vesse In attempting to use the {{docker-image-tool.sh}} scripts to build some custom Dockerfiles I ran into issues with the scripts interaction with Docker. Most notably if the Docker build/push fails the script continues blindly ignoring the errors. This can either result in complete failure to build or lead to subtle bugs where images are built against different base images than expected. Additionally while the Dockerfiles assume that Spark is first built locally the scripts fail to validate this which they could easily do by checking the expected JARs location. This can also lead to failed Docker builds which could easily be avoided. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24902) Add integration tests for PVs
[ https://issues.apache.org/jira/browse/SPARK-24902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651621#comment-16651621 ] Stavros Kontopoulos commented on SPARK-24902: - This is blocked by this issue: https://github.com/fabric8io/kubernetes-client/issues/1234 > Add integration tests for PVs > - > > Key: SPARK-24902 > URL: https://issues.apache.org/jira/browse/SPARK-24902 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Priority: Minor > > PVs and hostpath support has been added recently > (https://github.com/apache/spark/pull/21260/files) for Spark on K8s. > We should have some integration tests based on local storage. > It is easy to add PVs to minikube and attatch them to pods. > We could target a known dir like /tmp for the PV path. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24902) Add integration tests for PVs
[ https://issues.apache.org/jira/browse/SPARK-24902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651621#comment-16651621 ] Stavros Kontopoulos edited comment on SPARK-24902 at 10/16/18 12:50 PM: This is blocked by this issue: [https://github.com/fabric8io/kubernetes-client/issues/1234] Target is to use: https://github.com/fabric8io/kubernetes-client/issues/1234 was (Author: skonto): This is blocked by this issue: https://github.com/fabric8io/kubernetes-client/issues/1234 > Add integration tests for PVs > - > > Key: SPARK-24902 > URL: https://issues.apache.org/jira/browse/SPARK-24902 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Priority: Minor > > PVs and hostpath support has been added recently > (https://github.com/apache/spark/pull/21260/files) for Spark on K8s. > We should have some integration tests based on local storage. > It is easy to add PVs to minikube and attatch them to pods. > We could target a known dir like /tmp for the PV path. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24902) Add integration tests for PVs
[ https://issues.apache.org/jira/browse/SPARK-24902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651621#comment-16651621 ] Stavros Kontopoulos edited comment on SPARK-24902 at 10/16/18 12:50 PM: This is blocked by this issue: [https://github.com/fabric8io/kubernetes-client/issues/1234] Target is to use: https://kubernetes.io/blog/2018/04/13/local-persistent-volumes-beta was (Author: skonto): This is blocked by this issue: [https://github.com/fabric8io/kubernetes-client/issues/1234] Target is to use: https://github.com/fabric8io/kubernetes-client/issues/1234 > Add integration tests for PVs > - > > Key: SPARK-24902 > URL: https://issues.apache.org/jira/browse/SPARK-24902 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Priority: Minor > > PVs and hostpath support has been added recently > (https://github.com/apache/spark/pull/21260/files) for Spark on K8s. > We should have some integration tests based on local storage. > It is easy to add PVs to minikube and attatch them to pods. > We could target a known dir like /tmp for the PV path. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24499) Documentation improvement of Spark core and SQL
[ https://issues.apache.org/jira/browse/SPARK-24499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651628#comment-16651628 ] Yuanjian Li commented on SPARK-24499: - {quote} Is this just about splitting up the docs? {quote} If I understand correctly, splitting is the first step. {quote} let's break that down a little further into JIRAs. {quote} No problem, we'll do the follow up work with creating sub-task for this JIRA. > Documentation improvement of Spark core and SQL > --- > > Key: SPARK-24499 > URL: https://issues.apache.org/jira/browse/SPARK-24499 > Project: Spark > Issue Type: New Feature > Components: Documentation, Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > The current documentation in Apache Spark lacks enough code examples and > tips. If needed, we should also split the page of > https://spark.apache.org/docs/latest/sql-programming-guide.html to multiple > separate pages like what we did for > https://spark.apache.org/docs/latest/ml-guide.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25745) docker-image-tool.sh ignores errors from Docker
[ https://issues.apache.org/jira/browse/SPARK-25745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651630#comment-16651630 ] Apache Spark commented on SPARK-25745: -- User 'rvesse' has created a pull request for this issue: https://github.com/apache/spark/pull/22748 > docker-image-tool.sh ignores errors from Docker > --- > > Key: SPARK-25745 > URL: https://issues.apache.org/jira/browse/SPARK-25745 > Project: Spark > Issue Type: Bug > Components: Deploy, Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Rob Vesse >Priority: Major > > In attempting to use the {{docker-image-tool.sh}} scripts to build some > custom Dockerfiles I ran into issues with the scripts interaction with > Docker. Most notably if the Docker build/push fails the script continues > blindly ignoring the errors. This can either result in complete failure to > build or lead to subtle bugs where images are built against different base > images than expected. > Additionally while the Dockerfiles assume that Spark is first built locally > the scripts fail to validate this which they could easily do by checking the > expected JARs location. This can also lead to failed Docker builds which > could easily be avoided. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25745) docker-image-tool.sh ignores errors from Docker
[ https://issues.apache.org/jira/browse/SPARK-25745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25745: Assignee: (was: Apache Spark) > docker-image-tool.sh ignores errors from Docker > --- > > Key: SPARK-25745 > URL: https://issues.apache.org/jira/browse/SPARK-25745 > Project: Spark > Issue Type: Bug > Components: Deploy, Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Rob Vesse >Priority: Major > > In attempting to use the {{docker-image-tool.sh}} scripts to build some > custom Dockerfiles I ran into issues with the scripts interaction with > Docker. Most notably if the Docker build/push fails the script continues > blindly ignoring the errors. This can either result in complete failure to > build or lead to subtle bugs where images are built against different base > images than expected. > Additionally while the Dockerfiles assume that Spark is first built locally > the scripts fail to validate this which they could easily do by checking the > expected JARs location. This can also lead to failed Docker builds which > could easily be avoided. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25745) docker-image-tool.sh ignores errors from Docker
[ https://issues.apache.org/jira/browse/SPARK-25745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25745: Assignee: Apache Spark > docker-image-tool.sh ignores errors from Docker > --- > > Key: SPARK-25745 > URL: https://issues.apache.org/jira/browse/SPARK-25745 > Project: Spark > Issue Type: Bug > Components: Deploy, Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Rob Vesse >Assignee: Apache Spark >Priority: Major > > In attempting to use the {{docker-image-tool.sh}} scripts to build some > custom Dockerfiles I ran into issues with the scripts interaction with > Docker. Most notably if the Docker build/push fails the script continues > blindly ignoring the errors. This can either result in complete failure to > build or lead to subtle bugs where images are built against different base > images than expected. > Additionally while the Dockerfiles assume that Spark is first built locally > the scripts fail to validate this which they could easily do by checking the > expected JARs location. This can also lead to failed Docker builds which > could easily be avoided. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25745) docker-image-tool.sh ignores errors from Docker
[ https://issues.apache.org/jira/browse/SPARK-25745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651631#comment-16651631 ] Apache Spark commented on SPARK-25745: -- User 'rvesse' has created a pull request for this issue: https://github.com/apache/spark/pull/22748 > docker-image-tool.sh ignores errors from Docker > --- > > Key: SPARK-25745 > URL: https://issues.apache.org/jira/browse/SPARK-25745 > Project: Spark > Issue Type: Bug > Components: Deploy, Kubernetes >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Rob Vesse >Priority: Major > > In attempting to use the {{docker-image-tool.sh}} scripts to build some > custom Dockerfiles I ran into issues with the scripts interaction with > Docker. Most notably if the Docker build/push fails the script continues > blindly ignoring the errors. This can either result in complete failure to > build or lead to subtle bugs where images are built against different base > images than expected. > Additionally while the Dockerfiles assume that Spark is first built locally > the scripts fail to validate this which they could easily do by checking the > expected JARs location. This can also lead to failed Docker builds which > could easily be avoided. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal
[ https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651657#comment-16651657 ] Thomas Graves commented on SPARK-25732: --- So like Marcelo mentioned can't you re-use the keytab/principal option already there? It might need slightly modified to pull from HDFS but that is really what this is doing, its just livy is submitting the job for you. Really the user could specify it when submitting the job as a conf (? I guess depends on who is calling livy, jupyter for instance definitely could as user can pass configs). I would prefer that over adding more configs. There are lots of cases things are in the middle of job submission, livy, oozie, other workflow managers. I don't see that as a reason not to do tokens. User should know they are submitting jobs (especially one that runs for 2 weeks) and until we have a good automated solution, they would have to setup cron or something else to push tokens before they expire. I know the YARN folks were looking at options to help with this but haven't synced with them lately as ideally there would be a way to push the tokens to the RM for it to continue to renew so you would only have to do it before max lifetime. Its easy enough to write a script that runs and does a list of applications running for the user and push tokens to each of those, assuming we had spark-submit option to push tokens. > Allow specifying a keytab/principal for proxy user for token renewal > - > > Key: SPARK-25732 > URL: https://issues.apache.org/jira/browse/SPARK-25732 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > As of now, application submitted with proxy-user fail after 2 week due to the > lack of token renewal. In order to enable it, we need the the > keytab/principal of the impersonated user to be specified, in order to have > them available for the token renewal. > This JIRA proposes to add two parameters {{--proxy-user-principal}} and > {{--proxy-user-keytab}}, and the last letting a keytab being specified also > in a distributed FS, so that applications can be submitted by servers (eg. > Livy, Zeppelin) without needing all users' principals being on that machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25705) Remove Kafka 0.8 support in Spark 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-25705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25705: - Assignee: Sean Owen > Remove Kafka 0.8 support in Spark 3.0.0 > --- > > Key: SPARK-25705 > URL: https://issues.apache.org/jira/browse/SPARK-25705 > Project: Spark > Issue Type: Task > Components: Build, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > Fix For: 3.0.0 > > > In the move to Spark 3.0, we're talking about removing support for old legacy > connectors and versions of them. Kafka is about 2-3 major versions (depending > on how you count it) on from Kafka 0.8, and should likely be retired. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25705) Remove Kafka 0.8 support in Spark 3.0.0
[ https://issues.apache.org/jira/browse/SPARK-25705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25705. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22703 [https://github.com/apache/spark/pull/22703] > Remove Kafka 0.8 support in Spark 3.0.0 > --- > > Key: SPARK-25705 > URL: https://issues.apache.org/jira/browse/SPARK-25705 > Project: Spark > Issue Type: Task > Components: Build, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Major > Fix For: 3.0.0 > > > In the move to Spark 3.0, we're talking about removing support for old legacy > connectors and versions of them. Kafka is about 2-3 major versions (depending > on how you count it) on from Kafka 0.8, and should likely be retired. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null
[ https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Jones updated SPARK-25739: Flags: Important (was: Patch,Important) > Double quote coming in as empty value even when emptyValue set as null > -- > > Key: SPARK-25739 > URL: https://issues.apache.org/jira/browse/SPARK-25739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Databricks - 4.2 (includes Apache Spark 2.3.1, Scala > 2.11) >Reporter: Brian Jones >Priority: Major > > Example code - > {code:java} > val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") > df > .repartition(1) > .write > .mode("overwrite") > .option("nullValue", null) > .option("emptyValue", null) > .option("delimiter",",") > .option("quoteMode", "NONE") > .option("escape","\\") > .format("csv") > .save("/tmp/nullcsv/") > var out = dbutils.fs.ls("/tmp/nullcsv/") > var file = out(out.size - 1) > val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) > println(x) > {code} > Output - > {code:java} > 1,"" > 3,hi > 2,hello > 4, > {code} > Expected output - > {code:java} > 1, > 3,hi > 2,hello > 4, > {code} > > [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] > This commit is relevant to my issue. > "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In > version 2.3 and earlier, empty strings are equal to `null` values and do not > reflect to any characters in saved CSV files." > I am on Spark version 2.3.1, so empty strings should be coming as null. Even > then, I am passing the correct "emptyValue" option. However, my empty values > are stilling coming as `""` in the written file. > > I have tested the provided code in Databricks runtime environment 5.0 and > 4.1, and it is giving the expected output. However in Databricks runtime > 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25739) Double quote coming in as empty value even when emptyValue set as null
[ https://issues.apache.org/jira/browse/SPARK-25739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian Jones updated SPARK-25739: Flags: Patch,Important (was: Important) > Double quote coming in as empty value even when emptyValue set as null > -- > > Key: SPARK-25739 > URL: https://issues.apache.org/jira/browse/SPARK-25739 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 > Environment: Databricks - 4.2 (includes Apache Spark 2.3.1, Scala > 2.11) >Reporter: Brian Jones >Priority: Major > > Example code - > {code:java} > val df = List((1,""),(2,"hello"),(3,"hi"),(4,null)).toDF("key","value") > df > .repartition(1) > .write > .mode("overwrite") > .option("nullValue", null) > .option("emptyValue", null) > .option("delimiter",",") > .option("quoteMode", "NONE") > .option("escape","\\") > .format("csv") > .save("/tmp/nullcsv/") > var out = dbutils.fs.ls("/tmp/nullcsv/") > var file = out(out.size - 1) > val x = dbutils.fs.head("/tmp/nullcsv/" + file.name) > println(x) > {code} > Output - > {code:java} > 1,"" > 3,hi > 2,hello > 4, > {code} > Expected output - > {code:java} > 1, > 3,hi > 2,hello > 4, > {code} > > [https://github.com/apache/spark/commit/b7efca7ece484ee85091b1b50bbc84ad779f9bfe] > This commit is relevant to my issue. > "Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In > version 2.3 and earlier, empty strings are equal to `null` values and do not > reflect to any characters in saved CSV files." > I am on Spark version 2.3.1, so empty strings should be coming as null. Even > then, I am passing the correct "emptyValue" option. However, my empty values > are stilling coming as `""` in the written file. > > I have tested the provided code in Databricks runtime environment 5.0 and > 4.1, and it is giving the expected output. However in Databricks runtime > 4.2 and 4.3 (which are running spark 2.3.1) we get the incorrect output. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal
[ https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651800#comment-16651800 ] Marco Gaido commented on SPARK-25732: - [~tgraves] I think they can be reused, the point is that it may be confusing that: {code} kinit -kt super.keytab su...@example.com spark-submit --principal a...@example.com --keytab hdfs:///a.keytab --proxy-user a {code} runs with user {{super}} impersonating user {{a}}, while {code} kinit -kt super.keytab su...@example.com spark-submit --principal a...@example.com --keytab hdfs:///a.keytab {code} runs with user {{a}}. So the reason why I was proposing different configs is for clarity of the end user. I think the other point is that giving to the external systems the responsibility of pushing tokens can cause an indefinite number of issues and it is going to be hard to understand where the responsibility is. Centralizing the responsibility in Spark, would allow all these intermediate systems to work properly without any issue. > Allow specifying a keytab/principal for proxy user for token renewal > - > > Key: SPARK-25732 > URL: https://issues.apache.org/jira/browse/SPARK-25732 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > As of now, application submitted with proxy-user fail after 2 week due to the > lack of token renewal. In order to enable it, we need the the > keytab/principal of the impersonated user to be specified, in order to have > them available for the token renewal. > This JIRA proposes to add two parameters {{--proxy-user-principal}} and > {{--proxy-user-keytab}}, and the last letting a keytab being specified also > in a distributed FS, so that applications can be submitted by servers (eg. > Livy, Zeppelin) without needing all users' principals being on that machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal
[ https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651827#comment-16651827 ] Thomas Graves commented on SPARK-25732: --- yeah I understand the concern, we don't want to confuse user if we can help it. The 2 command above are the same except the --proxy-user and in my opinion, I don't think it would be confusing to the user, you pass in the keytab and principal for the user who's credentials you want refreshed, in this case its user "a" in both cases. Seems like making sure docs are clear should make it clear to the users. I assume most users submitting via livy don't realize they are using livy and being launched as proxy-user. So user would just specify keytab/principal configs based on their own user. > Allow specifying a keytab/principal for proxy user for token renewal > - > > Key: SPARK-25732 > URL: https://issues.apache.org/jira/browse/SPARK-25732 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > As of now, application submitted with proxy-user fail after 2 week due to the > lack of token renewal. In order to enable it, we need the the > keytab/principal of the impersonated user to be specified, in order to have > them available for the token renewal. > This JIRA proposes to add two parameters {{--proxy-user-principal}} and > {{--proxy-user-keytab}}, and the last letting a keytab being specified also > in a distributed FS, so that applications can be submitted by servers (eg. > Livy, Zeppelin) without needing all users' principals being on that machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal
[ https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651840#comment-16651840 ] Thomas Graves commented on SPARK-25732: --- sorry just realized I misread the second one. why would you run the second command? I would actually expect that to fail or run as the super user unless it downloaded the keytab and kinit'd on submission before it did anything with hdfs, etc. > Allow specifying a keytab/principal for proxy user for token renewal > - > > Key: SPARK-25732 > URL: https://issues.apache.org/jira/browse/SPARK-25732 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > As of now, application submitted with proxy-user fail after 2 week due to the > lack of token renewal. In order to enable it, we need the the > keytab/principal of the impersonated user to be specified, in order to have > them available for the token renewal. > This JIRA proposes to add two parameters {{--proxy-user-principal}} and > {{--proxy-user-keytab}}, and the last letting a keytab being specified also > in a distributed FS, so that applications can be submitted by servers (eg. > Livy, Zeppelin) without needing all users' principals being on that machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal
[ https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651840#comment-16651840 ] Thomas Graves edited comment on SPARK-25732 at 10/16/18 2:49 PM: - sorry just realized I misread the second one, though it was kinit as user a. why would you run the second command? I would actually expect that to fail or run as the super user unless it downloaded the keytab and kinit'd on submission before it did anything with hdfs, etc. was (Author: tgraves): sorry just realized I misread the second one. why would you run the second command? I would actually expect that to fail or run as the super user unless it downloaded the keytab and kinit'd on submission before it did anything with hdfs, etc. > Allow specifying a keytab/principal for proxy user for token renewal > - > > Key: SPARK-25732 > URL: https://issues.apache.org/jira/browse/SPARK-25732 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > As of now, application submitted with proxy-user fail after 2 week due to the > lack of token renewal. In order to enable it, we need the the > keytab/principal of the impersonated user to be specified, in order to have > them available for the token renewal. > This JIRA proposes to add two parameters {{--proxy-user-principal}} and > {{--proxy-user-keytab}}, and the last letting a keytab being specified also > in a distributed FS, so that applications can be submitted by servers (eg. > Livy, Zeppelin) without needing all users' principals being on that machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal
[ https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651840#comment-16651840 ] Thomas Graves edited comment on SPARK-25732 at 10/16/18 2:53 PM: - sorry just realized I misread the second one, thought it was kinit as user a. why would you run the second command? I would actually expect that to fail or run as the super user unless it downloaded the keytab and kinit'd on submission before it did anything with hdfs, etc. I guess that is the confusion you were referring to and can see that but it seems like an odd use case to me. Is something submitting this way now? It almost seems like something we should disallow. was (Author: tgraves): sorry just realized I misread the second one, though it was kinit as user a. why would you run the second command? I would actually expect that to fail or run as the super user unless it downloaded the keytab and kinit'd on submission before it did anything with hdfs, etc. > Allow specifying a keytab/principal for proxy user for token renewal > - > > Key: SPARK-25732 > URL: https://issues.apache.org/jira/browse/SPARK-25732 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > As of now, application submitted with proxy-user fail after 2 week due to the > lack of token renewal. In order to enable it, we need the the > keytab/principal of the impersonated user to be specified, in order to have > them available for the token renewal. > This JIRA proposes to add two parameters {{--proxy-user-principal}} and > {{--proxy-user-keytab}}, and the last letting a keytab being specified also > in a distributed FS, so that applications can be submitted by servers (eg. > Livy, Zeppelin) without needing all users' principals being on that machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal
[ https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651860#comment-16651860 ] Marco Gaido commented on SPARK-25732: - [~tgraves] yes, exactly it is what I am referring as confusing. The point is, currently, if we specify --principal and --keytab they are used to login (please see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L348). So those are the credential used (regardless of if/what you are kinited as). > Allow specifying a keytab/principal for proxy user for token renewal > - > > Key: SPARK-25732 > URL: https://issues.apache.org/jira/browse/SPARK-25732 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > As of now, application submitted with proxy-user fail after 2 week due to the > lack of token renewal. In order to enable it, we need the the > keytab/principal of the impersonated user to be specified, in order to have > them available for the token renewal. > This JIRA proposes to add two parameters {{--proxy-user-principal}} and > {{--proxy-user-keytab}}, and the last letting a keytab being specified also > in a distributed FS, so that applications can be submitted by servers (eg. > Livy, Zeppelin) without needing all users' principals being on that machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25746) Refactoring ExpressionEncoder
Liang-Chi Hsieh created SPARK-25746: --- Summary: Refactoring ExpressionEncoder Key: SPARK-25746 URL: https://issues.apache.org/jira/browse/SPARK-25746 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh This is inspired during implementing SPARK-24762. For now ScalaReflection needs to consider how ExpressionEncoder uses generated serializers and deserializers. And ExpressionEncoder has a weird flat flag. After discussion with [~cloud_fan], it seems to be better to refactor ExpressionEncoder. It should make SPARK-24762 easier to do. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25746) Refactoring ExpressionEncoder
[ https://issues.apache.org/jira/browse/SPARK-25746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25746: Assignee: (was: Apache Spark) > Refactoring ExpressionEncoder > - > > Key: SPARK-25746 > URL: https://issues.apache.org/jira/browse/SPARK-25746 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > This is inspired during implementing SPARK-24762. For now ScalaReflection > needs to consider how ExpressionEncoder uses generated serializers and > deserializers. And ExpressionEncoder has a weird flat flag. After discussion > with [~cloud_fan], it seems to be better to refactor ExpressionEncoder. It > should make SPARK-24762 easier to do. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25746) Refactoring ExpressionEncoder
[ https://issues.apache.org/jira/browse/SPARK-25746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16651927#comment-16651927 ] Apache Spark commented on SPARK-25746: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/22749 > Refactoring ExpressionEncoder > - > > Key: SPARK-25746 > URL: https://issues.apache.org/jira/browse/SPARK-25746 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > This is inspired during implementing SPARK-24762. For now ScalaReflection > needs to consider how ExpressionEncoder uses generated serializers and > deserializers. And ExpressionEncoder has a weird flat flag. After discussion > with [~cloud_fan], it seems to be better to refactor ExpressionEncoder. It > should make SPARK-24762 easier to do. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25746) Refactoring ExpressionEncoder
[ https://issues.apache.org/jira/browse/SPARK-25746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25746: Assignee: Apache Spark > Refactoring ExpressionEncoder > - > > Key: SPARK-25746 > URL: https://issues.apache.org/jira/browse/SPARK-25746 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Major > > This is inspired during implementing SPARK-24762. For now ScalaReflection > needs to consider how ExpressionEncoder uses generated serializers and > deserializers. And ExpressionEncoder has a weird flat flag. After discussion > with [~cloud_fan], it seems to be better to refactor ExpressionEncoder. It > should make SPARK-24762 easier to do. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25747) remove ColumnarBatchScan.needsUnsafeRowConversion
Wenchen Fan created SPARK-25747: --- Summary: remove ColumnarBatchScan.needsUnsafeRowConversion Key: SPARK-25747 URL: https://issues.apache.org/jira/browse/SPARK-25747 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25747) remove ColumnarBatchScan.needsUnsafeRowConversion
[ https://issues.apache.org/jira/browse/SPARK-25747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25747: Assignee: Apache Spark (was: Wenchen Fan) > remove ColumnarBatchScan.needsUnsafeRowConversion > - > > Key: SPARK-25747 > URL: https://issues.apache.org/jira/browse/SPARK-25747 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25747) remove ColumnarBatchScan.needsUnsafeRowConversion
[ https://issues.apache.org/jira/browse/SPARK-25747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652007#comment-16652007 ] Apache Spark commented on SPARK-25747: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/22750 > remove ColumnarBatchScan.needsUnsafeRowConversion > - > > Key: SPARK-25747 > URL: https://issues.apache.org/jira/browse/SPARK-25747 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25747) remove ColumnarBatchScan.needsUnsafeRowConversion
[ https://issues.apache.org/jira/browse/SPARK-25747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25747: Assignee: Wenchen Fan (was: Apache Spark) > remove ColumnarBatchScan.needsUnsafeRowConversion > - > > Key: SPARK-25747 > URL: https://issues.apache.org/jira/browse/SPARK-25747 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25732) Allow specifying a keytab/principal for proxy user for token renewal
[ https://issues.apache.org/jira/browse/SPARK-25732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652029#comment-16652029 ] Marcelo Vanzin commented on SPARK-25732: bq. livy server would not know which data sources to fetch tokens for That is true. Two approaches that currently exist are Spark's (get tokens for everything it can, even if they won't be used) and Oozie's (I believe; make the user explicitly choose at submission time which tokens the app will need). bq. So the reason why I was proposing different configs is for clarity of the end user. Why would the user care how the Spark application is started by a service? And in the second case you would not need the kinit for the service account. bq. I think the other point is that giving to the external systems the responsibility of pushing tokens can cause an indefinite number of issues It also solves two issues with the other approach: not everybody has a keytab, and those who do generally dislike their keytabs being sent around the network and stored in a bunch of places. > Allow specifying a keytab/principal for proxy user for token renewal > - > > Key: SPARK-25732 > URL: https://issues.apache.org/jira/browse/SPARK-25732 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Priority: Major > > As of now, application submitted with proxy-user fail after 2 week due to the > lack of token renewal. In order to enable it, we need the the > keytab/principal of the impersonated user to be specified, in order to have > them available for the token renewal. > This JIRA proposes to add two parameters {{--proxy-user-principal}} and > {{--proxy-user-keytab}}, and the last letting a keytab being specified also > in a distributed FS, so that applications can be submitted by servers (eg. > Livy, Zeppelin) without needing all users' principals being on that machine. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25643) Performance issues querying wide rows
[ https://issues.apache.org/jira/browse/SPARK-25643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652039#comment-16652039 ] Ruslan Dautkhanov commented on SPARK-25643: --- [~viirya] we confirm this problem on our production workloads too. Realizing wide tables that have columnar backends is super expensive. In comments of SPARK-25164 you can see that reading *even simple queries of fetching 70k rows takes 20 minutes* in a tables with 10m records. It would be great if Spark have optimizations to realize only columns that are required in `where` clause first, and after filtering realize rest of columns perhaps - it seems this would fix this huge performance overhead on wide datasets. Some key piece from [~bersprockets]'s findings are {quote}According to initial profiling, it appears that most time is spent realizing the entire row in the scan, just so the filter can look at a tiny subset of columns and almost certainly throw the row away .. The profiling shows 74% of time is spent in FileSourceScanExec{quote} > Performance issues querying wide rows > - > > Key: SPARK-25643 > URL: https://issues.apache.org/jira/browse/SPARK-25643 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Priority: Major > > Querying a small subset of rows from a wide table (e.g., a table with 6000 > columns) can be quite slow in the following case: > * the table has many rows (most of which will be filtered out) > * the projection includes every column of a wide table (i.e., select *) > * predicate push down is not helping: either matching rows are sprinkled > fairly evenly throughout the table, or predicate push down is switched off > Even if the filter involves only a single column and the returned result > includes just a few rows, the query can run much longer compared to an > equivalent query against a similar table with fewer columns. > According to initial profiling, it appears that most time is spent realizing > the entire row in the scan, just so the filter can look at a tiny subset of > columns and almost certainly throw the row away. The profiling shows 74% of > time is spent in FileSourceScanExec, and that time is spent across numerous > writeFields_0_xxx method calls. > If Spark must realize the entire row just to check a tiny subset of columns, > this all sounds reasonable. However, I wonder if there is an optimization > here where we can avoid realizing the entire row until after the filter has > selected the row. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25742) Is there a way to pass the Azure blob storage credentials to the spark for k8s init-container?
[ https://issues.apache.org/jira/browse/SPARK-25742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652064#comment-16652064 ] Yinan Li commented on SPARK-25742: -- The k8s secrets you add through the {{spark.kubernetes.driver.secrets.}} config option will also get mounted into the init-container in the driver pod. You can use that to pass credential for pulling dependencies into the driver init-container. > Is there a way to pass the Azure blob storage credentials to the spark for > k8s init-container? > -- > > Key: SPARK-25742 > URL: https://issues.apache.org/jira/browse/SPARK-25742 > Project: Spark > Issue Type: Question > Components: Kubernetes >Affects Versions: 2.3.2 >Reporter: Oscar Bonilla >Priority: Minor > > I'm trying to run spark on a kubernetes cluster in Azure. The idea is to > store the Spark application jars and dependencies in a container in Azure > Blob Storage. > I've tried to do this with a public container and this works OK, but when > having a private Blob Storage container, the spark-init init container > doesn't download the jars. > The equivalent in AWS S3 is as simple as adding the key_id and secret as > environment variables, but I don't see how to do this for Azure Blob Storage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25748) Upgrade net.razorvine:pyrolite version
Ankur Gupta created SPARK-25748: --- Summary: Upgrade net.razorvine:pyrolite version Key: SPARK-25748 URL: https://issues.apache.org/jira/browse/SPARK-25748 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Ankur Gupta The following high security vulnerability was discovered: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2007-1100. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17875) Remove unneeded direct dependence on Netty 3.x
[ https://issues.apache.org/jira/browse/SPARK-17875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-17875: --- > Remove unneeded direct dependence on Netty 3.x > -- > > Key: SPARK-17875 > URL: https://issues.apache.org/jira/browse/SPARK-17875 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.0.1 >Reporter: Sean Owen >Priority: Trivial > > The Spark build declares a dependency on Netty 3.x and 4.x, but only 4.x is > used. It's best to remove the 3.x dependency (and while we're at it, update a > few things like license info) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25748) Upgrade net.razorvine:pyrolite version
[ https://issues.apache.org/jira/browse/SPARK-25748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652109#comment-16652109 ] Marcelo Vanzin commented on SPARK-25748: That CVE seems completely unrelated to pyrolite? Or am I missing something here? > Upgrade net.razorvine:pyrolite version > -- > > Key: SPARK-25748 > URL: https://issues.apache.org/jira/browse/SPARK-25748 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Ankur Gupta >Priority: Major > > The following high security vulnerability was discovered: > https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2007-1100. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25749) Exception thrown while reading avro file with large schema
Raj created SPARK-25749: --- Summary: Exception thrown while reading avro file with large schema Key: SPARK-25749 URL: https://issues.apache.org/jira/browse/SPARK-25749 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.2, 2.3.1, 2.3.0 Reporter: Raj Hi, We are migrating our jobs from Spark 2.2.0 to Spark 2.3.1. One of the job reads avro source that has large nested schema. The job fails for Spark 2.3.1(Have tested in Spark 2.3.0 & Spark 2.3.2 and the job fails in this case also). I am able to replicate this with some sample data. Please find below the code, build file & exception log *Code (EncoderExample.scala)* package com.rj.enc import com.rj.logger.CustomLogger import org.apache.log4j.Logger import com.rj.sc.SparkUtil import org.apache.spark.sql.catalyst.ScalaReflection import org.apache.spark.sql.types.StructType import org.apache.spark.sql.Encoders object EncoderExample { val log: Logger = CustomLogger.getLogger(this.getClass.getName.dropRight(1)) val user = "xxx" val sourcePath = s"file:///Users/$user/del/avrodata" val resultPath = s"file:///Users/$user/del/pqdata" def main(args: Array[String]): Unit = { writeData() // Create sample data readData() // Read, Process & write back the results (App fails in this method for spark 2.3.1) } def readData(): Unit = { log.info("sourcePath -> " + sourcePath) val ss = SparkUtil.getSparkSession(this.getClass.getName) val schema = ScalaReflection.schemaFor[MainCC].dataType.asInstanceOf[StructType] import com.databricks.spark.avro._ import ss.implicits._ val ds = ss.sqlContext.read.schema(schema).option("basePath", sourcePath). avro(this.sourcePath).as[MainCC] log.info("Schema -> " + ds.schema.treeString) log.info("Count x -> " + ds.count) val encr = Encoders.product[ResultCC] val res = ds.map{ x => val es: Long = x.header.tamp ResultCC(es = es) }(encr) res.write.parquet(this.resultPath) } def writeData(): Unit = { val ss = SparkUtil.getSparkSession(this.getClass.getName) import ss.implicits._ val ds = ss.sparkContext.parallelize(Seq(MainCC(), MainCC())).toDF//.as[MainCC] log.info("source count 5 -> " + ds.count) import com.databricks.spark.avro._ ds.write.avro(this.sourcePath) log.info("Written") } } final case class ResultCC( es: Long) *Case Class (Schema of source avro data)* package com.rj.enc case class Header(tamp: Long = 12, xy: Option[String] = Some("aaa")) case class Key(hi: Option[String] = Some("aaa")) case class L30 ( l1: Option[Double] = Some(123d) ,l2: Option[Double] = Some(123d) ,l3: Option[String] = Some("aaa") ,l4: Option[String] = Some("aaa") ,l5: Option[String] = Some("aaa") ,l6: Option[String] = Some("aaa") ,l7: Option[String] = Some("aaa") ) case class C45 ( r1: Option[String] = Some("aaa") ,r2: Option[String] = Some("aaa") ) case class B45 ( e1: Option[String] = Some("aaa") ,e2: Option[Int] = Some(123) ,e3: Option[String] = Some("aaa") ) case class D45 (`t1`: Option[String] = Some("aaa")) case class M30 ( b1: Option[B45] = Some(B45()) ,b2: Option[C45] = Some(C45()) ,b3: Option[D45] = Some(D45()) ) case class Y50 ( g1: Option[String] = Some("aaa") ,g2: Option[String] = Some("aaa") ,g3: Option[String] = Some("aaa") ,g4: Option[String] = Some("aaa") ,g5: Option[String] = Some("aaa") ,g6: Option[String] = Some("aaa") ) case class X50 ( c1: Option[String] = Some("aaa") ,c2: Option[String] = Some("aaa") ,c3: Option[String] = Some("aaa") ,c4: Option[String] = Some("aaa") ) case class L10 ( u1: Option[String] = Some("aaa") ,u2: Option[String] = Some("aaa") ,u3: Option[String] = Some("aaa") ,u4: Option[String] = Some("aaa") ,u5: Option[Y50] = Some(Y50()) ,u6: Option[X50] = Some(X50()) ,u7: Option[String] = Some("a
[jira] [Updated] (SPARK-25749) Exception thrown while reading avro file with large schema
[ https://issues.apache.org/jira/browse/SPARK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raj updated SPARK-25749: Attachment: MainCC.scala EncoderExample.scala > Exception thrown while reading avro file with large schema > -- > > Key: SPARK-25749 > URL: https://issues.apache.org/jira/browse/SPARK-25749 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Raj >Priority: Blocker > Attachments: EncoderExample.scala, MainCC.scala > > > Hi, We are migrating our jobs from Spark 2.2.0 to Spark 2.3.1. One of the job > reads avro source that has large nested schema. The job fails for Spark > 2.3.1(Have tested in Spark 2.3.0 & Spark 2.3.2 and the job fails in this case > also). I am able to replicate this with some sample data. Please find below > the code, build file & exception log > *Code (EncoderExample.scala)* > > package com.rj.enc > import com.rj.logger.CustomLogger > import org.apache.log4j.Logger > import com.rj.sc.SparkUtil > import org.apache.spark.sql.catalyst.ScalaReflection > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.Encoders > object EncoderExample { > > val log: Logger = CustomLogger.getLogger(this.getClass.getName.dropRight(1)) > val user = "xxx" > val sourcePath = s"file:///Users/$user/del/avrodata" > val resultPath = s"file:///Users/$user/del/pqdata" > > def main(args: Array[String]): Unit = { > writeData() // Create sample data > readData() // Read, Process & write back the results (App fails in this > method for spark 2.3.1) > } > > def readData(): Unit = { > log.info("sourcePath -> " + sourcePath) > val ss = SparkUtil.getSparkSession(this.getClass.getName) > val schema = > ScalaReflection.schemaFor[MainCC].dataType.asInstanceOf[StructType] > import com.databricks.spark.avro._ > import ss.implicits._ > val ds = ss.sqlContext.read.schema(schema).option("basePath", sourcePath). > avro(this.sourcePath).as[MainCC] > log.info("Schema -> " + ds.schema.treeString) > log.info("Count x -> " + ds.count) > val encr = Encoders.product[ResultCC] > val res = ds.map{ x => > val es: Long = x.header.tamp > ResultCC(es = es) > }(encr) > res.write.parquet(this.resultPath) > } > > def writeData(): Unit = { > val ss = SparkUtil.getSparkSession(this.getClass.getName) > import ss.implicits._ > val ds = ss.sparkContext.parallelize(Seq(MainCC(), > MainCC())).toDF//.as[MainCC] > log.info("source count 5 -> " + ds.count) > import com.databricks.spark.avro._ > ds.write.avro(this.sourcePath) > log.info("Written") > } > > } > final case class ResultCC( > es: Long) > *Case Class (Schema of source avro data)* > package com.rj.enc > > case class Header(tamp: Long = 12, xy: Option[String] = > Some("aaa")) > > case class Key(hi: Option[String] = > Some("aaa")) > > case class L30 ( > l1: Option[Double] = Some(123d) > ,l2: Option[Double] = Some(123d) > ,l3: Option[String] = Some("aaa") > ,l4: Option[String] = Some("aaa") > ,l5: Option[String] = Some("aaa") > ,l6: Option[String] = Some("aaa") > ,l7: Option[String] = Some("aaa") > ) > > case class C45 ( > r1: Option[String] = Some("aaa") > ,r2: Option[String] = Some("aaa") > ) > > case class B45 ( > e1: Option[String] = Some("aaa") > ,e2: Option[Int] = Some(123) > ,e3: Option[String] = Some("aaa") > ) > > case class D45 (`t1`: Option[String] = > Some("aaa")) > > case class M30 ( > b1: Option[B45] = Some(B45()) > ,b2: Option[C45] = Some(C45()) > ,b3: Option[D45] = Some(D45()) > ) > > case class Y50 ( > g1: Option[String] = Some("aaa") > ,g2: Option[String] = Some("aaa") > ,g3: Option[String] = Some("aaa") > ,g4: Option[String] = Some("aaa") > ,g5: Option[String] = Some("aaa") > ,g6: Option[String] = Some("aaa") > ) > > case class X50 ( > c1: Option[String] = Some("aaa") > ,c2: Option[String] = Some("aaa") > ,c3: Option[String] = Some("aaa")
[jira] [Updated] (SPARK-25749) Exception thrown while reading avro file with large schema
[ https://issues.apache.org/jira/browse/SPARK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raj updated SPARK-25749: Attachment: build.sbt > Exception thrown while reading avro file with large schema > -- > > Key: SPARK-25749 > URL: https://issues.apache.org/jira/browse/SPARK-25749 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Raj >Priority: Blocker > Attachments: EncoderExample.scala, MainCC.scala, build.sbt > > > Hi, We are migrating our jobs from Spark 2.2.0 to Spark 2.3.1. One of the job > reads avro source that has large nested schema. The job fails for Spark > 2.3.1(Have tested in Spark 2.3.0 & Spark 2.3.2 and the job fails in this case > also). I am able to replicate this with some sample data. Please find below > the code, build file & exception log > *Code (EncoderExample.scala)* > > package com.rj.enc > import com.rj.logger.CustomLogger > import org.apache.log4j.Logger > import com.rj.sc.SparkUtil > import org.apache.spark.sql.catalyst.ScalaReflection > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.Encoders > object EncoderExample { > > val log: Logger = CustomLogger.getLogger(this.getClass.getName.dropRight(1)) > val user = "xxx" > val sourcePath = s"file:///Users/$user/del/avrodata" > val resultPath = s"file:///Users/$user/del/pqdata" > > def main(args: Array[String]): Unit = { > writeData() // Create sample data > readData() // Read, Process & write back the results (App fails in this > method for spark 2.3.1) > } > > def readData(): Unit = { > log.info("sourcePath -> " + sourcePath) > val ss = SparkUtil.getSparkSession(this.getClass.getName) > val schema = > ScalaReflection.schemaFor[MainCC].dataType.asInstanceOf[StructType] > import com.databricks.spark.avro._ > import ss.implicits._ > val ds = ss.sqlContext.read.schema(schema).option("basePath", sourcePath). > avro(this.sourcePath).as[MainCC] > log.info("Schema -> " + ds.schema.treeString) > log.info("Count x -> " + ds.count) > val encr = Encoders.product[ResultCC] > val res = ds.map{ x => > val es: Long = x.header.tamp > ResultCC(es = es) > }(encr) > res.write.parquet(this.resultPath) > } > > def writeData(): Unit = { > val ss = SparkUtil.getSparkSession(this.getClass.getName) > import ss.implicits._ > val ds = ss.sparkContext.parallelize(Seq(MainCC(), > MainCC())).toDF//.as[MainCC] > log.info("source count 5 -> " + ds.count) > import com.databricks.spark.avro._ > ds.write.avro(this.sourcePath) > log.info("Written") > } > > } > final case class ResultCC( > es: Long) > *Case Class (Schema of source avro data)* > package com.rj.enc > > case class Header(tamp: Long = 12, xy: Option[String] = > Some("aaa")) > > case class Key(hi: Option[String] = > Some("aaa")) > > case class L30 ( > l1: Option[Double] = Some(123d) > ,l2: Option[Double] = Some(123d) > ,l3: Option[String] = Some("aaa") > ,l4: Option[String] = Some("aaa") > ,l5: Option[String] = Some("aaa") > ,l6: Option[String] = Some("aaa") > ,l7: Option[String] = Some("aaa") > ) > > case class C45 ( > r1: Option[String] = Some("aaa") > ,r2: Option[String] = Some("aaa") > ) > > case class B45 ( > e1: Option[String] = Some("aaa") > ,e2: Option[Int] = Some(123) > ,e3: Option[String] = Some("aaa") > ) > > case class D45 (`t1`: Option[String] = > Some("aaa")) > > case class M30 ( > b1: Option[B45] = Some(B45()) > ,b2: Option[C45] = Some(C45()) > ,b3: Option[D45] = Some(D45()) > ) > > case class Y50 ( > g1: Option[String] = Some("aaa") > ,g2: Option[String] = Some("aaa") > ,g3: Option[String] = Some("aaa") > ,g4: Option[String] = Some("aaa") > ,g5: Option[String] = Some("aaa") > ,g6: Option[String] = Some("aaa") > ) > > case class X50 ( > c1: Option[String] = Some("aaa") > ,c2: Option[String] = Some("aaa") > ,c3: Option[String] = Some("aaa") > ,c4: Option[String] = Some
[jira] [Updated] (SPARK-25749) Exception thrown while reading avro file with large schema
[ https://issues.apache.org/jira/browse/SPARK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raj updated SPARK-25749: Attachment: exception > Exception thrown while reading avro file with large schema > -- > > Key: SPARK-25749 > URL: https://issues.apache.org/jira/browse/SPARK-25749 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2 >Reporter: Raj >Priority: Blocker > Attachments: EncoderExample.scala, MainCC.scala, build.sbt, exception > > > Hi, We are migrating our jobs from Spark 2.2.0 to Spark 2.3.1. One of the job > reads avro source that has large nested schema. The job fails for Spark > 2.3.1(Have tested in Spark 2.3.0 & Spark 2.3.2 and the job fails in this case > also). I am able to replicate this with some sample data. Please find below > the code, build file & exception log > *Code (EncoderExample.scala)* > > package com.rj.enc > import com.rj.logger.CustomLogger > import org.apache.log4j.Logger > import com.rj.sc.SparkUtil > import org.apache.spark.sql.catalyst.ScalaReflection > import org.apache.spark.sql.types.StructType > import org.apache.spark.sql.Encoders > object EncoderExample { > > val log: Logger = CustomLogger.getLogger(this.getClass.getName.dropRight(1)) > val user = "xxx" > val sourcePath = s"file:///Users/$user/del/avrodata" > val resultPath = s"file:///Users/$user/del/pqdata" > > def main(args: Array[String]): Unit = { > writeData() // Create sample data > readData() // Read, Process & write back the results (App fails in this > method for spark 2.3.1) > } > > def readData(): Unit = { > log.info("sourcePath -> " + sourcePath) > val ss = SparkUtil.getSparkSession(this.getClass.getName) > val schema = > ScalaReflection.schemaFor[MainCC].dataType.asInstanceOf[StructType] > import com.databricks.spark.avro._ > import ss.implicits._ > val ds = ss.sqlContext.read.schema(schema).option("basePath", sourcePath). > avro(this.sourcePath).as[MainCC] > log.info("Schema -> " + ds.schema.treeString) > log.info("Count x -> " + ds.count) > val encr = Encoders.product[ResultCC] > val res = ds.map{ x => > val es: Long = x.header.tamp > ResultCC(es = es) > }(encr) > res.write.parquet(this.resultPath) > } > > def writeData(): Unit = { > val ss = SparkUtil.getSparkSession(this.getClass.getName) > import ss.implicits._ > val ds = ss.sparkContext.parallelize(Seq(MainCC(), > MainCC())).toDF//.as[MainCC] > log.info("source count 5 -> " + ds.count) > import com.databricks.spark.avro._ > ds.write.avro(this.sourcePath) > log.info("Written") > } > > } > final case class ResultCC( > es: Long) > *Case Class (Schema of source avro data)* > package com.rj.enc > > case class Header(tamp: Long = 12, xy: Option[String] = > Some("aaa")) > > case class Key(hi: Option[String] = > Some("aaa")) > > case class L30 ( > l1: Option[Double] = Some(123d) > ,l2: Option[Double] = Some(123d) > ,l3: Option[String] = Some("aaa") > ,l4: Option[String] = Some("aaa") > ,l5: Option[String] = Some("aaa") > ,l6: Option[String] = Some("aaa") > ,l7: Option[String] = Some("aaa") > ) > > case class C45 ( > r1: Option[String] = Some("aaa") > ,r2: Option[String] = Some("aaa") > ) > > case class B45 ( > e1: Option[String] = Some("aaa") > ,e2: Option[Int] = Some(123) > ,e3: Option[String] = Some("aaa") > ) > > case class D45 (`t1`: Option[String] = > Some("aaa")) > > case class M30 ( > b1: Option[B45] = Some(B45()) > ,b2: Option[C45] = Some(C45()) > ,b3: Option[D45] = Some(D45()) > ) > > case class Y50 ( > g1: Option[String] = Some("aaa") > ,g2: Option[String] = Some("aaa") > ,g3: Option[String] = Some("aaa") > ,g4: Option[String] = Some("aaa") > ,g5: Option[String] = Some("aaa") > ,g6: Option[String] = Some("aaa") > ) > > case class X50 ( > c1: Option[String] = Some("aaa") > ,c2: Option[String] = Some("aaa") > ,c3: Option[String] = Some("aaa") > ,c4: Option[Str
[jira] [Updated] (SPARK-25749) Exception thrown while reading avro file with large schema
[ https://issues.apache.org/jira/browse/SPARK-25749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raj updated SPARK-25749: Description: Hi, We are migrating our jobs from Spark 2.2.0 to Spark 2.3.1. One of the job reads avro source that has large nested schema. The job fails for Spark 2.3.1(Have tested in Spark 2.3.0 & Spark 2.3.2 and the job fails in this case also). I am able to replicate this with some sample data + dummy case class. Please find attached the, *Code*: EncoderExample.scala, MainCC.scala & build.sbt *Exception log*: exception PS: I am getting exception \{{java.lang.OutOfMemoryError: Java heap space}}. I have tried increasing the JVM size in eclipse, but that does not help either I have also tested the code in Spark 2.2.2 and works fine. Seems like this bug introduced in Spark 2.3.0 was: Hi, We are migrating our jobs from Spark 2.2.0 to Spark 2.3.1. One of the job reads avro source that has large nested schema. The job fails for Spark 2.3.1(Have tested in Spark 2.3.0 & Spark 2.3.2 and the job fails in this case also). I am able to replicate this with some sample data. Please find below the code, build file & exception log *Code (EncoderExample.scala)* package com.rj.enc import com.rj.logger.CustomLogger import org.apache.log4j.Logger import com.rj.sc.SparkUtil import org.apache.spark.sql.catalyst.ScalaReflection import org.apache.spark.sql.types.StructType import org.apache.spark.sql.Encoders object EncoderExample { val log: Logger = CustomLogger.getLogger(this.getClass.getName.dropRight(1)) val user = "xxx" val sourcePath = s"file:///Users/$user/del/avrodata" val resultPath = s"file:///Users/$user/del/pqdata" def main(args: Array[String]): Unit = { writeData() // Create sample data readData() // Read, Process & write back the results (App fails in this method for spark 2.3.1) } def readData(): Unit = { log.info("sourcePath -> " + sourcePath) val ss = SparkUtil.getSparkSession(this.getClass.getName) val schema = ScalaReflection.schemaFor[MainCC].dataType.asInstanceOf[StructType] import com.databricks.spark.avro._ import ss.implicits._ val ds = ss.sqlContext.read.schema(schema).option("basePath", sourcePath). avro(this.sourcePath).as[MainCC] log.info("Schema -> " + ds.schema.treeString) log.info("Count x -> " + ds.count) val encr = Encoders.product[ResultCC] val res = ds.map{ x => val es: Long = x.header.tamp ResultCC(es = es) }(encr) res.write.parquet(this.resultPath) } def writeData(): Unit = { val ss = SparkUtil.getSparkSession(this.getClass.getName) import ss.implicits._ val ds = ss.sparkContext.parallelize(Seq(MainCC(), MainCC())).toDF//.as[MainCC] log.info("source count 5 -> " + ds.count) import com.databricks.spark.avro._ ds.write.avro(this.sourcePath) log.info("Written") } } final case class ResultCC( es: Long) *Case Class (Schema of source avro data)* package com.rj.enc case class Header(tamp: Long = 12, xy: Option[String] = Some("aaa")) case class Key(hi: Option[String] = Some("aaa")) case class L30 ( l1: Option[Double] = Some(123d) ,l2: Option[Double] = Some(123d) ,l3: Option[String] = Some("aaa") ,l4: Option[String] = Some("aaa") ,l5: Option[String] = Some("aaa") ,l6: Option[String] = Some("aaa") ,l7: Option[String] = Some("aaa") ) case class C45 ( r1: Option[String] = Some("aaa") ,r2: Option[String] = Some("aaa") ) case class B45 ( e1: Option[String] = Some("aaa") ,e2: Option[Int] = Some(123) ,e3: Option[String] = Some("aaa") ) case class D45 (`t1`: Option[String] = Some("aaa")) case class M30 ( b1: Option[B45] = Some(B45()) ,b2: Option[C45] = Some(C45()) ,b3: Option[D45] = Some(D45()) ) case class Y50 ( g1: Option[String] = Some("aaa") ,g2: Option[String] = Some("aaa") ,g3: Option[String] = Some("aaa") ,g4: Option[String] = Some("aaa") ,g5: Option[String] = Some("aaa") ,g6: Option[String] = Some("aaa") ) case class X50 ( c1: Option[String] = Some("aaa") ,c2: Option[String] = Some("aaa") ,c3: Option[String] = Some("aaa")
[jira] [Resolved] (SPARK-25748) Upgrade net.razorvine:pyrolite version
[ https://issues.apache.org/jira/browse/SPARK-25748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Gupta resolved SPARK-25748. - Resolution: Invalid Yeah, it seems this jar was tagged incorrectly in the internal security scan. Closing the jira. > Upgrade net.razorvine:pyrolite version > -- > > Key: SPARK-25748 > URL: https://issues.apache.org/jira/browse/SPARK-25748 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Ankur Gupta >Priority: Major > > The following high security vulnerability was discovered: > https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2007-1100. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20327) Add CLI support for YARN custom resources, like GPUs
[ https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652393#comment-16652393 ] Apache Spark commented on SPARK-20327: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/22751 > Add CLI support for YARN custom resources, like GPUs > > > Key: SPARK-20327 > URL: https://issues.apache.org/jira/browse/SPARK-20327 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: Daniel Templeton >Assignee: Szilard Nemeth >Priority: Major > Labels: newbie > Fix For: 3.0.0 > > > YARN-3926 adds the ability for administrators to configure custom resources, > like GPUs. This JIRA is to add support to Spark for requesting resources > other than CPU virtual cores and memory. See YARN-3926. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20327) Add CLI support for YARN custom resources, like GPUs
[ https://issues.apache.org/jira/browse/SPARK-20327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652395#comment-16652395 ] Apache Spark commented on SPARK-20327: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/22751 > Add CLI support for YARN custom resources, like GPUs > > > Key: SPARK-20327 > URL: https://issues.apache.org/jira/browse/SPARK-20327 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: Daniel Templeton >Assignee: Szilard Nemeth >Priority: Major > Labels: newbie > Fix For: 3.0.0 > > > YARN-3926 adds the ability for administrators to configure custom resources, > like GPUs. This JIRA is to add support to Spark for requesting resources > other than CPU virtual cores and memory. See YARN-3926. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25750) Secure HDFS Integration Testing
Ilan Filonenko created SPARK-25750: -- Summary: Secure HDFS Integration Testing Key: SPARK-25750 URL: https://issues.apache.org/jira/browse/SPARK-25750 Project: Spark Issue Type: Test Components: Kubernetes Affects Versions: 3.0.0 Reporter: Ilan Filonenko Integration testing for Secure HDFS interaction for Spark on Kubernetes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25750) Secure HDFS Integration Testing
[ https://issues.apache.org/jira/browse/SPARK-25750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25750: Assignee: Apache Spark > Secure HDFS Integration Testing > --- > > Key: SPARK-25750 > URL: https://issues.apache.org/jira/browse/SPARK-25750 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Assignee: Apache Spark >Priority: Major > > Integration testing for Secure HDFS interaction for Spark on Kubernetes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25750) Secure HDFS Integration Testing
[ https://issues.apache.org/jira/browse/SPARK-25750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652400#comment-16652400 ] Apache Spark commented on SPARK-25750: -- User 'ifilonenko' has created a pull request for this issue: https://github.com/apache/spark/pull/22608 > Secure HDFS Integration Testing > --- > > Key: SPARK-25750 > URL: https://issues.apache.org/jira/browse/SPARK-25750 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Priority: Major > > Integration testing for Secure HDFS interaction for Spark on Kubernetes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25750) Secure HDFS Integration Testing
[ https://issues.apache.org/jira/browse/SPARK-25750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25750: Assignee: (was: Apache Spark) > Secure HDFS Integration Testing > --- > > Key: SPARK-25750 > URL: https://issues.apache.org/jira/browse/SPARK-25750 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Priority: Major > > Integration testing for Secure HDFS interaction for Spark on Kubernetes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652440#comment-16652440 ] t oo commented on SPARK-20202: -- bump > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Major > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25751) Unit Testing for Kerberos Support for Spark on Kubernetes
Ilan Filonenko created SPARK-25751: -- Summary: Unit Testing for Kerberos Support for Spark on Kubernetes Key: SPARK-25751 URL: https://issues.apache.org/jira/browse/SPARK-25751 Project: Spark Issue Type: Test Components: Kubernetes Affects Versions: 3.0.0 Reporter: Ilan Filonenko Unit tests for Kerberos Support within Spark on Kubernetes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652449#comment-16652449 ] t oo commented on SPARK-18673: -- bump > Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version > -- > > Key: SPARK-18673 > URL: https://issues.apache.org/jira/browse/SPARK-18673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT >Reporter: Steve Loughran >Priority: Major > > Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader > considers 3.x to be an unknown Hadoop version. > Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it > will need to be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25750) Integration Testing for Kerberos Support for Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-25750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ilan Filonenko updated SPARK-25750: --- Summary: Integration Testing for Kerberos Support for Spark on Kubernetes (was: Secure HDFS Integration Testing) > Integration Testing for Kerberos Support for Spark on Kubernetes > > > Key: SPARK-25750 > URL: https://issues.apache.org/jira/browse/SPARK-25750 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Ilan Filonenko >Priority: Major > > Integration testing for Secure HDFS interaction for Spark on Kubernetes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25752) Add trait to easily whitelist logical operators that produce named output from CleanupAliases
Tathagata Das created SPARK-25752: - Summary: Add trait to easily whitelist logical operators that produce named output from CleanupAliases Key: SPARK-25752 URL: https://issues.apache.org/jira/browse/SPARK-25752 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Tathagata Das Assignee: Tathagata Das The rule `CleanupAliases` cleans up aliases from logical operators that do not match a whitelist. This whitelist is hardcoded inside the rule which is cumbersome. This PR is to clean that up by making a trait `HasNamedOutput` that will be ignored by `CleanupAliases` and other ops that require aliases to be preserved in the operator should extend it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25394) Expose App status metrics as Source
[ https://issues.apache.org/jira/browse/SPARK-25394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-25394: -- Assignee: Stavros Kontopoulos > Expose App status metrics as Source > --- > > Key: SPARK-25394 > URL: https://issues.apache.org/jira/browse/SPARK-25394 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > ApplicationListener in Spark core captures useful metrics like job duration > which are exposed via the spark rest api (available both at the driver's ui > and the history server ui) only. From a devops perspective and especially on > k8s where metrics scrapping is often done via the prometheus jmx exporter it > would be good to have all metrics in one place, avoidn scraping the metrics > rest api. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25394) Expose App status metrics as Source
[ https://issues.apache.org/jira/browse/SPARK-25394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-25394. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22381 [https://github.com/apache/spark/pull/22381] > Expose App status metrics as Source > --- > > Key: SPARK-25394 > URL: https://issues.apache.org/jira/browse/SPARK-25394 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Stavros Kontopoulos >Assignee: Stavros Kontopoulos >Priority: Major > Fix For: 3.0.0 > > > ApplicationListener in Spark core captures useful metrics like job duration > which are exposed via the spark rest api (available both at the driver's ui > and the history server ui) only. From a devops perspective and especially on > k8s where metrics scrapping is often done via the prometheus jmx exporter it > would be good to have all metrics in one place, avoidn scraping the metrics > rest api. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25631) KafkaRDDSuite: basic usage 2 min 4 sec
[ https://issues.apache.org/jira/browse/SPARK-25631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25631. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22670 [https://github.com/apache/spark/pull/22670] > KafkaRDDSuite: basic usage2 min 4 sec > --- > > Key: SPARK-25631 > URL: https://issues.apache.org/jira/browse/SPARK-25631 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > Fix For: 3.0.0 > > > org.apache.spark.streaming.kafka010.KafkaRDDSuite.basic usage > Took 2 min 4 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25631) KafkaRDDSuite: basic usage 2 min 4 sec
[ https://issues.apache.org/jira/browse/SPARK-25631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25631: - Assignee: Dilip Biswal > KafkaRDDSuite: basic usage2 min 4 sec > --- > > Key: SPARK-25631 > URL: https://issues.apache.org/jira/browse/SPARK-25631 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > Fix For: 3.0.0 > > > org.apache.spark.streaming.kafka010.KafkaRDDSuite.basic usage > Took 2 min 4 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25632) KafkaRDDSuite: compacted topic 2 min 5 sec.
[ https://issues.apache.org/jira/browse/SPARK-25632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25632. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22670 [https://github.com/apache/spark/pull/22670] > KafkaRDDSuite: compacted topic 2 min 5 sec. > --- > > Key: SPARK-25632 > URL: https://issues.apache.org/jira/browse/SPARK-25632 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > Fix For: 3.0.0 > > > org.apache.spark.streaming.kafka010.KafkaRDDSuite.compacted topic > Took 2 min 5 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25632) KafkaRDDSuite: compacted topic 2 min 5 sec.
[ https://issues.apache.org/jira/browse/SPARK-25632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-25632: - Assignee: Dilip Biswal > KafkaRDDSuite: compacted topic 2 min 5 sec. > --- > > Key: SPARK-25632 > URL: https://issues.apache.org/jira/browse/SPARK-25632 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Dilip Biswal >Priority: Major > Fix For: 3.0.0 > > > org.apache.spark.streaming.kafka010.KafkaRDDSuite.compacted topic > Took 2 min 5 sec. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25693) Fix the multiple "manager" classes in org.apache.spark.deploy.security
[ https://issues.apache.org/jira/browse/SPARK-25693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-25693. Resolution: Duplicate I started looking at this while waiting for reviews of SPARK-23781, and it will be less noisy if I do both together. It ends up being not that much larger than SPARK-23781 in isolation. > Fix the multiple "manager" classes in org.apache.spark.deploy.security > -- > > Key: SPARK-25693 > URL: https://issues.apache.org/jira/browse/SPARK-25693 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > In SPARK-23781 I've done some refactoring which introduces an > {{AbstractCredentialManager}} class. That name sort of clashes with the > existing {{HadoopDelegationTokenManager}}. > Since the latter doesn't really manage anything, it just orchestrates the > fetching of the tokens, we could rename it to something more generic. Or even > clean up some of that class hierarchy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging
[ https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652687#comment-16652687 ] Devaraj K commented on SPARK-24787: --- It seems here the overhead is coming due the force call FileChannel.force in Datanode which is part of the hsync to write the data to the storage device. And the hsync is not making much difference with and without the flag SyncFlag.UPDATE_LENGTH, it might be because the update length is simple call to NameNode to update the length. I think the hsync change can be reverted, and the history server can get the latest file length using the DFSInputStream.getFileLength() which includes lastBlockBeingWrittenLength, if the cached length is same as FileStatus.getLen() then history server can make additional call to get the latest length using DFSInputStream.getFileLength() and decide whether to update the history log or not. > Events being dropped at an alarming rate due to hsync being slow for > eventLogging > - > > Key: SPARK-24787 > URL: https://issues.apache.org/jira/browse/SPARK-24787 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.0, 2.3.1 >Reporter: Sanket Reddy >Priority: Minor > > [https://github.com/apache/spark/pull/16924/files] updates the length of the > inprogress files allowing history server being responsive. > Although we have a production job that has 6 tasks per stage and due to > hsync being slow it starts dropping events and the history server has wrong > stats due to events being dropped. > A viable solution is not to make it sync very frequently or make it > configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging
[ https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24787: Assignee: Apache Spark > Events being dropped at an alarming rate due to hsync being slow for > eventLogging > - > > Key: SPARK-24787 > URL: https://issues.apache.org/jira/browse/SPARK-24787 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.0, 2.3.1 >Reporter: Sanket Reddy >Assignee: Apache Spark >Priority: Minor > > [https://github.com/apache/spark/pull/16924/files] updates the length of the > inprogress files allowing history server being responsive. > Although we have a production job that has 6 tasks per stage and due to > hsync being slow it starts dropping events and the history server has wrong > stats due to events being dropped. > A viable solution is not to make it sync very frequently or make it > configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging
[ https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24787: Assignee: (was: Apache Spark) > Events being dropped at an alarming rate due to hsync being slow for > eventLogging > - > > Key: SPARK-24787 > URL: https://issues.apache.org/jira/browse/SPARK-24787 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.0, 2.3.1 >Reporter: Sanket Reddy >Priority: Minor > > [https://github.com/apache/spark/pull/16924/files] updates the length of the > inprogress files allowing history server being responsive. > Although we have a production job that has 6 tasks per stage and due to > hsync being slow it starts dropping events and the history server has wrong > stats due to events being dropped. > A viable solution is not to make it sync very frequently or make it > configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging
[ https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652691#comment-16652691 ] Apache Spark commented on SPARK-24787: -- User 'devaraj-kavali' has created a pull request for this issue: https://github.com/apache/spark/pull/22752 > Events being dropped at an alarming rate due to hsync being slow for > eventLogging > - > > Key: SPARK-24787 > URL: https://issues.apache.org/jira/browse/SPARK-24787 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.0, 2.3.1 >Reporter: Sanket Reddy >Priority: Minor > > [https://github.com/apache/spark/pull/16924/files] updates the length of the > inprogress files allowing history server being responsive. > Although we have a production job that has 6 tasks per stage and due to > hsync being slow it starts dropping events and the history server has wrong > stats due to events being dropped. > A viable solution is not to make it sync very frequently or make it > configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging
[ https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652690#comment-16652690 ] Apache Spark commented on SPARK-24787: -- User 'devaraj-kavali' has created a pull request for this issue: https://github.com/apache/spark/pull/22752 > Events being dropped at an alarming rate due to hsync being slow for > eventLogging > - > > Key: SPARK-24787 > URL: https://issues.apache.org/jira/browse/SPARK-24787 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 2.3.0, 2.3.1 >Reporter: Sanket Reddy >Priority: Minor > > [https://github.com/apache/spark/pull/16924/files] updates the length of the > inprogress files allowing history server being responsive. > Although we have a production job that has 6 tasks per stage and due to > hsync being slow it starts dropping events and the history server has wrong > stats due to events being dropped. > A viable solution is not to make it sync very frequently or make it > configurable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25733) The method toLocalIterator() with dataframe doesn't work
[ https://issues.apache.org/jira/browse/SPARK-25733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652746#comment-16652746 ] Bryan Cutler commented on SPARK-25733: -- Is this a duplicate of SPARK-23961? > The method toLocalIterator() with dataframe doesn't work > > > Key: SPARK-25733 > URL: https://issues.apache.org/jira/browse/SPARK-25733 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: Spark in standalone mode, and 48 cores are available. > spark-defaults.conf as blew: > spark.pyshark.python /usr/bin/python3.6 > spark.driver.memory 4g > spark.executor.memory 8g > > other configurations are at default. >Reporter: Bihui Jin >Priority: Major > Attachments: report_dataset.zip.001, report_dataset.zip.002 > > > {color:#FF}The dataset which I used attached.{color} > > First I loaded a dataframe from local disk: > df = spark.read.load('report_dataset') > there are about 200 partitions stored in s3, and the max size of partitions > is 28.37MB. > > after data loaded, I execute "df.take(1)" to test the dataframe, and > expected output printed > "[Row(s3_link='https://dcm-ul-phy.s3-china-1.eecloud.nsn-net.net/normal/run2/pool1/Tests.NbIot.NBCellSetupDelete.LTE3374_CellSetup_4x5M_2RX_3CELevel_Loop100.html', > sequences=[364, 15, 184, 34, 524, 49, 30, 527, 44, 366, 125, 85, 69, 524, > 49, 389, 575, 29, 179, 447, 168, 3, 223, 116, 573, 524, 49, 30, 527, 56, 366, > 125, 85, 524, 118, 295, 440, 123, 389, 32, 575, 529, 192, 524, 49, 389, 575, > 29, 179, 29, 140, 268, 96, 508, 389, 32, 575, 529, 192, 524, 49, 389, 575, > 29, 179, 180, 451, 69, 286, 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, > 524, 49, 389, 575, 29, 42, 553, 451, 37, 125, 524, 49, 389, 575, 29, 42, 553, > 451, 368, 125, 88, 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, > 588, 524, 49, 389, 575, 29, 42, 553, 451, 368, 125, 88, 588, 524, 49, 389], > next_word=575, line_num=12)]" > > Then I try to convert dataframe to the local iterator and want to print one > row in dataframe for testing, and blew code is used: > for row in df.toLocalIterator(): > print(row) > break > {color:#ff}*But there is no output printed after that code > executed.*{color} > > Then I execute "df.take(1)" and blew error is reported: > ERROR:root:Exception while sending command. > Traceback (most recent call last): > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 1159, in send_command > raise Py4JNetworkError("Answer from Java side is empty") > py4j.protocol.Py4JNetworkError: Answer from Java side is empty > During handling of the above exception, another exception occurred: > ERROR:root:Exception while sending command. > Traceback (most recent call last): > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 1159, in send_command > raise Py4JNetworkError("Answer from Java side is empty") > py4j.protocol.Py4JNetworkError: Answer from Java side is empty > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 985, in send_command > response = connection.send_command(command) > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 1164, in send_command > "Error while receiving", e, proto.ERROR_ON_RECEIVE) > py4j.protocol.Py4JNetworkError: Error while receiving > ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java > server (127.0.0.1:37735) > Traceback (most recent call last): > File > "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", > line 2963, in run_code > exec(code_obj, self.user_global_ns, self.user_ns) > File "", line 1, in > df.take(1) > File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line > 504, in take > return self.limit(num).collect() > File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/dataframe.py", line > 493, in limit > jdf = self._jdf.limit(num) > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/java_gateway.py", line > 1257, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File "/opt/k2-v02/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, > in deco > return f(*a, **kw) > File "/opt/k2-v02/lib/python3.6/site-packages/py4j/protocol.py", line 336, in > get_return_value > format(target_id, ".", name)) > py4j.protocol.Py4JError: An error occurred while calling o29.limit > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/opt/k2-v02/lib/python3.6/site-packages/IPython/core/interactiveshell.py", > line 1863, in showtraceback > stb = v
[jira] [Created] (SPARK-25753) binaryFiles broken for small files
liuxian created SPARK-25753: --- Summary: binaryFiles broken for small files Key: SPARK-25753 URL: https://issues.apache.org/jira/browse/SPARK-25753 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 3.0.0 Reporter: liuxian {{StreamFileInputFormat}} and {{WholeTextFileInputFormat(https://issues.apache.org/jira/browse/SPARK-24610)}} have the same problem: for small sized files, the computed maxSplitSize by `{{StreamFileInputFormat}} ` is way smaller than the default or commonly used split size of 64/128M and spark throws an exception while trying to read them. {{Exception info:Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304 at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: 201) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org