[jira] [Updated] (SPARK-21031) Clearly separate hive stats and spark stats in catalog
[ https://issues.apache.org/jira/browse/SPARK-21031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-21031: - Description: Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. Therefore, in `CatalogStatistics`, we cannot tell whether its stats is from hive or spark. As a result, hive's stats can be unexpectedly propagated into spark' stats. For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will store the stats in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command. Secondly, now that we have spark's stats in metastore, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (should not exist) over hive's stats. {code} spark-sql> create table xx(i string, j string); spark-sql> insert into table xx select 'a', 'b'; spark-sql> desc formatted xx; # col_name data_type comment i string NULL j string NULL # Detailed Table Information Databasedefault Table xx Owner wzh Created Thu Jun 08 18:30:46 PDT 2017 Last Access Wed Dec 31 16:00:00 PST 1969 TypeMANAGED Providerhive Properties [serialization.format=1] Statistics 4 bytes Locationfile:/Users/wzh/Projects/spark/spark-warehouse/xx Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Partition Provider Catalog Time taken: 0.089 seconds, Fetched 19 row(s) spark-sql> alter table xx set tblproperties ('prop1' = 'yy'); Time taken: 0.187 seconds spark-sql> insert into table xx select 'c', 'd'; Time taken: 0.583 seconds spark-sql> desc formatted xx; # col_name data_type comment i string NULL j string NULL # Detailed Table Information Databasedefault Table xx Owner wzh Created Thu Jun 08 18:30:46 PDT 2017 Last Access Wed Dec 31 16:00:00 PST 1969 TypeMANAGED Providerhive Properties [serialization.format=1] Statistics 4 bytes (-- This should be 8 bytes) Locationfile:/Users/wzh/Projects/spark/spark-warehouse/xx Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Partition Provider Catalog Time taken: 0.077 seconds, Fetched 19 row(s) {code} was: Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. Therefore, in `CatalogStatistics`, we cannot tell whether its stats is from hive or spark. As a result, hive's stats can be unexpectedly propagated into spark' stats. For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will store the stats in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command. Secondly, now that we store wrong spark's stats, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (wrong stats) over hive's stats. {code} spark-sql> create table xx(i string, j string); spark-sql> insert into table xx select 'a', 'b'; spark-sql> desc formatted xx; # col_name data_type comment i string NULL j string NULL # Detailed Table Information Databasedefault Table xx Owner wzh Created Thu Jun 08 18:30:46 PDT 2017 Last Access Wed Dec 31 16:00:00 PST 1969 TypeMANAGED Providerhive Properties [serialization.format=1] Statistics 4 bytes Locationfile:/Users/wzh/Projects/spark/spark-warehouse/xx Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Partition Provider Catalog Time taken: 0.089 seconds, Fetched 19 row(s) spark-sql> alter table xx set tblproperties ('prop1' = 'yy'); Time taken: 0.187 sec
[jira] [Assigned] (SPARK-21031) Clearly separate hive stats and spark stats in catalog
[ https://issues.apache.org/jira/browse/SPARK-21031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21031: Assignee: (was: Apache Spark) > Clearly separate hive stats and spark stats in catalog > -- > > Key: SPARK-21031 > URL: https://issues.apache.org/jira/browse/SPARK-21031 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > Currently, hive's stats are read into `CatalogStatistics`, while spark's > stats are also persisted through `CatalogStatistics`. Therefore, in > `CatalogStatistics`, we cannot tell whether its stats is from hive or spark. > As a result, hive's stats can be unexpectedly propagated into spark' stats. > For example, for a catalog table, we read stats from hive, e.g. "totalSize" > and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we > will store the stats in `CatalogStatistics` into metastore as spark's stats > (because we don't know whether it's from spark or not). But spark's stats > should be only generated by "ANALYZE" command. This is unexpected from this > command. > Secondly, now that we store wrong spark's stats, after inserting new data, > although hive updated "totalSize" in metastore, we still cannot get the right > `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (wrong > stats) over hive's stats. > {code} > spark-sql> create table xx(i string, j string); > spark-sql> insert into table xx select 'a', 'b'; > spark-sql> desc formatted xx; > # col_namedata_type comment > i string NULL > j string NULL > # Detailed Table Information > Database default > Table xx > Owner wzh > Created Thu Jun 08 18:30:46 PDT 2017 > Last Access Wed Dec 31 16:00:00 PST 1969 > Type MANAGED > Provider hive > Properties[serialization.format=1] > Statistics4 bytes > Location file:/Users/wzh/Projects/spark/spark-warehouse/xx > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Partition ProviderCatalog > Time taken: 0.089 seconds, Fetched 19 row(s) > spark-sql> alter table xx set tblproperties ('prop1' = 'yy'); > Time taken: 0.187 seconds > spark-sql> insert into table xx select 'c', 'd'; > Time taken: 0.583 seconds > spark-sql> desc formatted xx; > # col_namedata_type comment > i string NULL > j string NULL > # Detailed Table Information > Database default > Table xx > Owner wzh > Created Thu Jun 08 18:30:46 PDT 2017 > Last Access Wed Dec 31 16:00:00 PST 1969 > Type MANAGED > Provider hive > Properties[serialization.format=1] > Statistics4 bytes (-- This should be 8 bytes) > Location file:/Users/wzh/Projects/spark/spark-warehouse/xx > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Partition ProviderCatalog > Time taken: 0.077 seconds, Fetched 19 row(s) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21031) Clearly separate hive stats and spark stats in catalog
[ https://issues.apache.org/jira/browse/SPARK-21031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16044009#comment-16044009 ] Apache Spark commented on SPARK-21031: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/18248 > Clearly separate hive stats and spark stats in catalog > -- > > Key: SPARK-21031 > URL: https://issues.apache.org/jira/browse/SPARK-21031 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang > > Currently, hive's stats are read into `CatalogStatistics`, while spark's > stats are also persisted through `CatalogStatistics`. Therefore, in > `CatalogStatistics`, we cannot tell whether its stats is from hive or spark. > As a result, hive's stats can be unexpectedly propagated into spark' stats. > For example, for a catalog table, we read stats from hive, e.g. "totalSize" > and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we > will store the stats in `CatalogStatistics` into metastore as spark's stats > (because we don't know whether it's from spark or not). But spark's stats > should be only generated by "ANALYZE" command. This is unexpected from this > command. > Secondly, now that we store wrong spark's stats, after inserting new data, > although hive updated "totalSize" in metastore, we still cannot get the right > `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (wrong > stats) over hive's stats. > {code} > spark-sql> create table xx(i string, j string); > spark-sql> insert into table xx select 'a', 'b'; > spark-sql> desc formatted xx; > # col_namedata_type comment > i string NULL > j string NULL > # Detailed Table Information > Database default > Table xx > Owner wzh > Created Thu Jun 08 18:30:46 PDT 2017 > Last Access Wed Dec 31 16:00:00 PST 1969 > Type MANAGED > Provider hive > Properties[serialization.format=1] > Statistics4 bytes > Location file:/Users/wzh/Projects/spark/spark-warehouse/xx > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Partition ProviderCatalog > Time taken: 0.089 seconds, Fetched 19 row(s) > spark-sql> alter table xx set tblproperties ('prop1' = 'yy'); > Time taken: 0.187 seconds > spark-sql> insert into table xx select 'c', 'd'; > Time taken: 0.583 seconds > spark-sql> desc formatted xx; > # col_namedata_type comment > i string NULL > j string NULL > # Detailed Table Information > Database default > Table xx > Owner wzh > Created Thu Jun 08 18:30:46 PDT 2017 > Last Access Wed Dec 31 16:00:00 PST 1969 > Type MANAGED > Provider hive > Properties[serialization.format=1] > Statistics4 bytes (-- This should be 8 bytes) > Location file:/Users/wzh/Projects/spark/spark-warehouse/xx > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Partition ProviderCatalog > Time taken: 0.077 seconds, Fetched 19 row(s) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21031) Clearly separate hive stats and spark stats in catalog
[ https://issues.apache.org/jira/browse/SPARK-21031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21031: Assignee: Apache Spark > Clearly separate hive stats and spark stats in catalog > -- > > Key: SPARK-21031 > URL: https://issues.apache.org/jira/browse/SPARK-21031 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang >Assignee: Apache Spark > > Currently, hive's stats are read into `CatalogStatistics`, while spark's > stats are also persisted through `CatalogStatistics`. Therefore, in > `CatalogStatistics`, we cannot tell whether its stats is from hive or spark. > As a result, hive's stats can be unexpectedly propagated into spark' stats. > For example, for a catalog table, we read stats from hive, e.g. "totalSize" > and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we > will store the stats in `CatalogStatistics` into metastore as spark's stats > (because we don't know whether it's from spark or not). But spark's stats > should be only generated by "ANALYZE" command. This is unexpected from this > command. > Secondly, now that we store wrong spark's stats, after inserting new data, > although hive updated "totalSize" in metastore, we still cannot get the right > `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (wrong > stats) over hive's stats. > {code} > spark-sql> create table xx(i string, j string); > spark-sql> insert into table xx select 'a', 'b'; > spark-sql> desc formatted xx; > # col_namedata_type comment > i string NULL > j string NULL > # Detailed Table Information > Database default > Table xx > Owner wzh > Created Thu Jun 08 18:30:46 PDT 2017 > Last Access Wed Dec 31 16:00:00 PST 1969 > Type MANAGED > Provider hive > Properties[serialization.format=1] > Statistics4 bytes > Location file:/Users/wzh/Projects/spark/spark-warehouse/xx > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Partition ProviderCatalog > Time taken: 0.089 seconds, Fetched 19 row(s) > spark-sql> alter table xx set tblproperties ('prop1' = 'yy'); > Time taken: 0.187 seconds > spark-sql> insert into table xx select 'c', 'd'; > Time taken: 0.583 seconds > spark-sql> desc formatted xx; > # col_namedata_type comment > i string NULL > j string NULL > # Detailed Table Information > Database default > Table xx > Owner wzh > Created Thu Jun 08 18:30:46 PDT 2017 > Last Access Wed Dec 31 16:00:00 PST 1969 > Type MANAGED > Provider hive > Properties[serialization.format=1] > Statistics4 bytes (-- This should be 8 bytes) > Location file:/Users/wzh/Projects/spark/spark-warehouse/xx > Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat org.apache.hadoop.mapred.TextInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Partition ProviderCatalog > Time taken: 0.077 seconds, Fetched 19 row(s) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21031) Clearly separate hive stats and spark stats in catalog
[ https://issues.apache.org/jira/browse/SPARK-21031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-21031: - Description: Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. Therefore, in `CatalogStatistics`, we cannot tell whether its stats is from hive or spark. As a result, hive's stats can be unexpectedly propagated into spark' stats. For example, for a catalog table, we read stats from hive, e.g. "totalSize" and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will store the stats in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command. Secondly, now that we store wrong spark's stats, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (wrong stats) over hive's stats. {code} spark-sql> create table xx(i string, j string); spark-sql> insert into table xx select 'a', 'b'; spark-sql> desc formatted xx; # col_name data_type comment i string NULL j string NULL # Detailed Table Information Databasedefault Table xx Owner wzh Created Thu Jun 08 18:30:46 PDT 2017 Last Access Wed Dec 31 16:00:00 PST 1969 TypeMANAGED Providerhive Properties [serialization.format=1] Statistics 4 bytes Locationfile:/Users/wzh/Projects/spark/spark-warehouse/xx Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Partition Provider Catalog Time taken: 0.089 seconds, Fetched 19 row(s) spark-sql> alter table xx set tblproperties ('prop1' = 'yy'); Time taken: 0.187 seconds spark-sql> insert into table xx select 'c', 'd'; Time taken: 0.583 seconds spark-sql> desc formatted xx; # col_name data_type comment i string NULL j string NULL # Detailed Table Information Databasedefault Table xx Owner wzh Created Thu Jun 08 18:30:46 PDT 2017 Last Access Wed Dec 31 16:00:00 PST 1969 TypeMANAGED Providerhive Properties [serialization.format=1] Statistics 4 bytes (-- This should be 8 bytes) Locationfile:/Users/wzh/Projects/spark/spark-warehouse/xx Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Partition Provider Catalog Time taken: 0.077 seconds, Fetched 19 row(s) {code} was: Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. Therefore, in `CatalogStatistics`, we cannot tell whether its stats is from hive or spark. As a result, hive's stats can be unexpectedly propagated into spark' stats. For example, by using "ALTER TABLE" command, we will store the stats info (read from hive, e.g. "totalSize") in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command. Besides, now that we store wrong spark's stats, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect the wrong spark stats over hive's stats. {code} spark-sql> create table xx(i string, j string); spark-sql> insert into table xx select 'a', 'b'; spark-sql> desc formatted xx; # col_name data_type comment i string NULL j string NULL # Detailed Table Information Databasedefault Table xx Owner wzh Created Thu Jun 08 18:30:46 PDT 2017 Last Access Wed Dec 31 16:00:00 PST 1969 TypeMANAGED Providerhive Properties [serialization.format=1] Statistics 4 bytes Locationfile:/Users/wzh/Projects/spark/spark-warehouse/xx Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Partition Provider Catalog Time taken: 0.089 seconds, Fetched 19 row(s) spark-sql> alter table xx set tblproperties ('prop1' = 'yy'); Time taken: 0.187 seconds spark-sql> insert into table xx select 'c', 'd'; Time taken: 0.583 seconds sp
[jira] [Created] (SPARK-21031) Clearly separate hive stats and spark stats in catalog
Zhenhua Wang created SPARK-21031: Summary: Clearly separate hive stats and spark stats in catalog Key: SPARK-21031 URL: https://issues.apache.org/jira/browse/SPARK-21031 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Zhenhua Wang Currently, hive's stats are read into `CatalogStatistics`, while spark's stats are also persisted through `CatalogStatistics`. Therefore, in `CatalogStatistics`, we cannot tell whether its stats is from hive or spark. As a result, hive's stats can be unexpectedly propagated into spark' stats. For example, by using "ALTER TABLE" command, we will store the stats info (read from hive, e.g. "totalSize") in `CatalogStatistics` into metastore as spark's stats (because we don't know whether it's from spark or not). But spark's stats should be only generated by "ANALYZE" command. This is unexpected from this command. Besides, now that we store wrong spark's stats, after inserting new data, although hive updated "totalSize" in metastore, we still cannot get the right `sizeInBytes` in `CatalogStatistics`, because we respect the wrong spark stats over hive's stats. {code} spark-sql> create table xx(i string, j string); spark-sql> insert into table xx select 'a', 'b'; spark-sql> desc formatted xx; # col_name data_type comment i string NULL j string NULL # Detailed Table Information Databasedefault Table xx Owner wzh Created Thu Jun 08 18:30:46 PDT 2017 Last Access Wed Dec 31 16:00:00 PST 1969 TypeMANAGED Providerhive Properties [serialization.format=1] Statistics 4 bytes Locationfile:/Users/wzh/Projects/spark/spark-warehouse/xx Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Partition Provider Catalog Time taken: 0.089 seconds, Fetched 19 row(s) spark-sql> alter table xx set tblproperties ('prop1' = 'yy'); Time taken: 0.187 seconds spark-sql> insert into table xx select 'c', 'd'; Time taken: 0.583 seconds spark-sql> desc formatted xx; # col_name data_type comment i string NULL j string NULL # Detailed Table Information Databasedefault Table xx Owner wzh Created Thu Jun 08 18:30:46 PDT 2017 Last Access Wed Dec 31 16:00:00 PST 1969 TypeMANAGED Providerhive Properties [serialization.format=1] Statistics 4 bytes (-- This should be 8 bytes) Locationfile:/Users/wzh/Projects/spark/spark-warehouse/xx Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.TextInputFormat OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Partition Provider Catalog Time taken: 0.077 seconds, Fetched 19 row(s) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"
[ https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043973#comment-16043973 ] Dongjoon Hyun commented on SPARK-20954: --- ^^ > DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |" > - > > Key: SPARK-20954 > URL: https://issues.apache.org/jira/browse/SPARK-20954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Garros Chan >Assignee: Dongjoon Hyun > Fix For: 2.2.0 > > > I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added > to the result. You can see there is this 1 extra row with "| # col_name | > data_type | comment |" ; however, select and select count(*) only shows 1 > row. > I searched online a long time and do not find any useful information. > Is this a bug? > hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline > Beeline version 1.2.1.spark2 by Apache Hive > [INFO] Unable to bind key for unsupported operation: backward-delete-word > [INFO] Unable to bind key for unsupported operation: backward-delete-word > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > beeline> !connect jdbc:hive2://localhost:10016 > Connecting to jdbc:hive2://localhost:10016 > Enter username for jdbc:hive2://localhost:10016: hive > Enter password for jdbc:hive2://localhost:10016: > 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016 > 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016 > 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with > JDBC Uri: jdbc:hive2://localhost:10016 > Connected to: Spark SQL (version 2.2.1-SNAPSHOT) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 0: jdbc:hive2://localhost:10016> describe garros.hivefloat; > +-++--+--+ > | col_name | data_type | comment | > +-++--+--+ > | # col_name | data_type | comment | > | c1 | float | NULL | > +-++--+--+ > 2 rows selected (0.396 seconds) > 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat; > +-+--+ > | c1 | > +-+--+ > | 123.99800109863281 | > +-+--+ > 1 row selected (0.319 seconds) > 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat; > +---+--+ > | count(1) | > +---+--+ > | 1 | > +---+--+ > 1 row selected (0.783 seconds) > 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint; > +---+-+--+--+ > | col_name| data_type > | comment | > +---+-+--+--+ > | # col_name| data_type > | comment | > | c1| int > | NULL | > | | > | | > | # Detailed Table Information | > | | > | Database | garros > | | > | Table | hiveint > | | > | Owner | root > | | > | Created | Thu Feb 09 17:40:36 EST 2017 > | | > | Last Access | Wed Dec 31 19:00:00 EST 1969 > | | > | Type | MANAGED > | | > | Provider | hive
[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"
[ https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043972#comment-16043972 ] Wenchen Fan commented on SPARK-20954: - oh sorry I misclicked... > DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |" > - > > Key: SPARK-20954 > URL: https://issues.apache.org/jira/browse/SPARK-20954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Garros Chan >Assignee: Dongjoon Hyun > Fix For: 2.2.0 > > > I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added > to the result. You can see there is this 1 extra row with "| # col_name | > data_type | comment |" ; however, select and select count(*) only shows 1 > row. > I searched online a long time and do not find any useful information. > Is this a bug? > hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline > Beeline version 1.2.1.spark2 by Apache Hive > [INFO] Unable to bind key for unsupported operation: backward-delete-word > [INFO] Unable to bind key for unsupported operation: backward-delete-word > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > beeline> !connect jdbc:hive2://localhost:10016 > Connecting to jdbc:hive2://localhost:10016 > Enter username for jdbc:hive2://localhost:10016: hive > Enter password for jdbc:hive2://localhost:10016: > 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016 > 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016 > 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with > JDBC Uri: jdbc:hive2://localhost:10016 > Connected to: Spark SQL (version 2.2.1-SNAPSHOT) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 0: jdbc:hive2://localhost:10016> describe garros.hivefloat; > +-++--+--+ > | col_name | data_type | comment | > +-++--+--+ > | # col_name | data_type | comment | > | c1 | float | NULL | > +-++--+--+ > 2 rows selected (0.396 seconds) > 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat; > +-+--+ > | c1 | > +-+--+ > | 123.99800109863281 | > +-+--+ > 1 row selected (0.319 seconds) > 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat; > +---+--+ > | count(1) | > +---+--+ > | 1 | > +---+--+ > 1 row selected (0.783 seconds) > 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint; > +---+-+--+--+ > | col_name| data_type > | comment | > +---+-+--+--+ > | # col_name| data_type > | comment | > | c1| int > | NULL | > | | > | | > | # Detailed Table Information | > | | > | Database | garros > | | > | Table | hiveint > | | > | Owner | root > | | > | Created | Thu Feb 09 17:40:36 EST 2017 > | | > | Last Access | Wed Dec 31 19:00:00 EST 1969 > | | > | Type | MANAGED > | | > | Provider | hive
[jira] [Assigned] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"
[ https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-20954: --- Assignee: Dongjoon Hyun (was: Liang-Chi Hsieh) > DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |" > - > > Key: SPARK-20954 > URL: https://issues.apache.org/jira/browse/SPARK-20954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Garros Chan >Assignee: Dongjoon Hyun > Fix For: 2.2.0 > > > I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added > to the result. You can see there is this 1 extra row with "| # col_name | > data_type | comment |" ; however, select and select count(*) only shows 1 > row. > I searched online a long time and do not find any useful information. > Is this a bug? > hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline > Beeline version 1.2.1.spark2 by Apache Hive > [INFO] Unable to bind key for unsupported operation: backward-delete-word > [INFO] Unable to bind key for unsupported operation: backward-delete-word > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > beeline> !connect jdbc:hive2://localhost:10016 > Connecting to jdbc:hive2://localhost:10016 > Enter username for jdbc:hive2://localhost:10016: hive > Enter password for jdbc:hive2://localhost:10016: > 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016 > 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016 > 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with > JDBC Uri: jdbc:hive2://localhost:10016 > Connected to: Spark SQL (version 2.2.1-SNAPSHOT) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 0: jdbc:hive2://localhost:10016> describe garros.hivefloat; > +-++--+--+ > | col_name | data_type | comment | > +-++--+--+ > | # col_name | data_type | comment | > | c1 | float | NULL | > +-++--+--+ > 2 rows selected (0.396 seconds) > 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat; > +-+--+ > | c1 | > +-+--+ > | 123.99800109863281 | > +-+--+ > 1 row selected (0.319 seconds) > 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat; > +---+--+ > | count(1) | > +---+--+ > | 1 | > +---+--+ > 1 row selected (0.783 seconds) > 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint; > +---+-+--+--+ > | col_name| data_type > | comment | > +---+-+--+--+ > | # col_name| data_type > | comment | > | c1| int > | NULL | > | | > | | > | # Detailed Table Information | > | | > | Database | garros > | | > | Table | hiveint > | | > | Owner | root > | | > | Created | Thu Feb 09 17:40:36 EST 2017 > | | > | Last Access | Wed Dec 31 19:00:00 EST 1969 > | | > | Type | MANAGED > | | > | Provider | hive
[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"
[ https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043948#comment-16043948 ] Dongjoon Hyun commented on SPARK-20954: --- Hi, Wenchen. I'm Dongjoon. :) > DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |" > - > > Key: SPARK-20954 > URL: https://issues.apache.org/jira/browse/SPARK-20954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Garros Chan >Assignee: Liang-Chi Hsieh > Fix For: 2.2.0 > > > I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added > to the result. You can see there is this 1 extra row with "| # col_name | > data_type | comment |" ; however, select and select count(*) only shows 1 > row. > I searched online a long time and do not find any useful information. > Is this a bug? > hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline > Beeline version 1.2.1.spark2 by Apache Hive > [INFO] Unable to bind key for unsupported operation: backward-delete-word > [INFO] Unable to bind key for unsupported operation: backward-delete-word > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > beeline> !connect jdbc:hive2://localhost:10016 > Connecting to jdbc:hive2://localhost:10016 > Enter username for jdbc:hive2://localhost:10016: hive > Enter password for jdbc:hive2://localhost:10016: > 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016 > 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016 > 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with > JDBC Uri: jdbc:hive2://localhost:10016 > Connected to: Spark SQL (version 2.2.1-SNAPSHOT) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 0: jdbc:hive2://localhost:10016> describe garros.hivefloat; > +-++--+--+ > | col_name | data_type | comment | > +-++--+--+ > | # col_name | data_type | comment | > | c1 | float | NULL | > +-++--+--+ > 2 rows selected (0.396 seconds) > 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat; > +-+--+ > | c1 | > +-+--+ > | 123.99800109863281 | > +-+--+ > 1 row selected (0.319 seconds) > 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat; > +---+--+ > | count(1) | > +---+--+ > | 1 | > +---+--+ > 1 row selected (0.783 seconds) > 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint; > +---+-+--+--+ > | col_name| data_type > | comment | > +---+-+--+--+ > | # col_name| data_type > | comment | > | c1| int > | NULL | > | | > | | > | # Detailed Table Information | > | | > | Database | garros > | | > | Table | hiveint > | | > | Owner | root > | | > | Created | Thu Feb 09 17:40:36 EST 2017 > | | > | Last Access | Wed Dec 31 19:00:00 EST 1969 > | | > | Type | MANAGED > | | > | Provider | hive
[jira] [Created] (SPARK-21030) extend hint syntax to support any expression for Python and R
Felix Cheung created SPARK-21030: Summary: extend hint syntax to support any expression for Python and R Key: SPARK-21030 URL: https://issues.apache.org/jira/browse/SPARK-21030 Project: Spark Issue Type: Improvement Components: PySpark, SparkR, SQL Affects Versions: 2.2.0 Reporter: Felix Cheung See SPARK-20854 we need to relax checks in https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422 and https://github.com/apache/spark/blob/7f203a248f94df6183a4bc4642a3d873171fef29/R/pkg/R/DataFrame.R#L3746 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21029) All StreamingQuery should be stopped when the SparkSession is stopped
Felix Cheung created SPARK-21029: Summary: All StreamingQuery should be stopped when the SparkSession is stopped Key: SPARK-21029 URL: https://issues.apache.org/jira/browse/SPARK-21029 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0, 2.3.0 Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20510) SparkR 2.2 QA: Update user guide for new features & APIs
[ https://issues.apache.org/jira/browse/SPARK-20510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043895#comment-16043895 ] Felix Cheung commented on SPARK-20510: -- credit to SPARK-20208 SPARK-20849 SPARK-20477 SPARK-20478 SPARK-20258 SPARK-20026 SPARK-20015 > SparkR 2.2 QA: Update user guide for new features & APIs > > > Key: SPARK-20510 > URL: https://issues.apache.org/jira/browse/SPARK-20510 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Check the user guide vs. a list of new APIs (classes, methods, data members) > to see what items require updates to the user guide. > For each feature missing user guide doc: > * Create a JIRA for that feature, and assign it to the author of the feature > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). > If you would like to work on this task, please comment, and we can create & > link JIRAs for parts of this work. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20511) SparkR 2.2 QA: Check for new R APIs requiring example code
[ https://issues.apache.org/jira/browse/SPARK-20511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043894#comment-16043894 ] Felix Cheung commented on SPARK-20511: -- credit to SPARK-20208 SPARK-20849 SPARK-20477 SPARK-20478 SPARK-20258 SPARK-20026 SPARK-20015 > SparkR 2.2 QA: Check for new R APIs requiring example code > -- > > Key: SPARK-20511 > URL: https://issues.apache.org/jira/browse/SPARK-20511 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley > > Audit list of new features added to MLlib's R API, and see which major items > are missing example code (in the examples folder). We do not need examples > for everything, only for major items such as new algorithms. > For any such items: > * Create a JIRA for that feature, and assign it to the author of the feature > (or yourself if interested). > * Link it to (a) the original JIRA which introduced that feature ("related > to") and (b) to this JIRA ("requires"). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20513) Update SparkR website for 2.2
[ https://issues.apache.org/jira/browse/SPARK-20513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043892#comment-16043892 ] Felix Cheung commented on SPARK-20513: -- right, I don't think there's a site for R https://github.com/apache/spark-website > Update SparkR website for 2.2 > - > > Key: SPARK-20513 > URL: https://issues.apache.org/jira/browse/SPARK-20513 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Reporter: Joseph K. Bradley >Priority: Critical > > Update the sub-project's website to include new features in this release. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"
[ https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-20954: --- Assignee: Liang-Chi Hsieh > DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |" > - > > Key: SPARK-20954 > URL: https://issues.apache.org/jira/browse/SPARK-20954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Garros Chan >Assignee: Liang-Chi Hsieh > Fix For: 2.2.0 > > > I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added > to the result. You can see there is this 1 extra row with "| # col_name | > data_type | comment |" ; however, select and select count(*) only shows 1 > row. > I searched online a long time and do not find any useful information. > Is this a bug? > hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline > Beeline version 1.2.1.spark2 by Apache Hive > [INFO] Unable to bind key for unsupported operation: backward-delete-word > [INFO] Unable to bind key for unsupported operation: backward-delete-word > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > beeline> !connect jdbc:hive2://localhost:10016 > Connecting to jdbc:hive2://localhost:10016 > Enter username for jdbc:hive2://localhost:10016: hive > Enter password for jdbc:hive2://localhost:10016: > 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016 > 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016 > 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with > JDBC Uri: jdbc:hive2://localhost:10016 > Connected to: Spark SQL (version 2.2.1-SNAPSHOT) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 0: jdbc:hive2://localhost:10016> describe garros.hivefloat; > +-++--+--+ > | col_name | data_type | comment | > +-++--+--+ > | # col_name | data_type | comment | > | c1 | float | NULL | > +-++--+--+ > 2 rows selected (0.396 seconds) > 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat; > +-+--+ > | c1 | > +-+--+ > | 123.99800109863281 | > +-+--+ > 1 row selected (0.319 seconds) > 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat; > +---+--+ > | count(1) | > +---+--+ > | 1 | > +---+--+ > 1 row selected (0.783 seconds) > 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint; > +---+-+--+--+ > | col_name| data_type > | comment | > +---+-+--+--+ > | # col_name| data_type > | comment | > | c1| int > | NULL | > | | > | | > | # Detailed Table Information | > | | > | Database | garros > | | > | Table | hiveint > | | > | Owner | root > | | > | Created | Thu Feb 09 17:40:36 EST 2017 > | | > | Last Access | Wed Dec 31 19:00:00 EST 1969 > | | > | Type | MANAGED > | | > | Provider | hive >
[jira] [Resolved] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"
[ https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20954. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 18245 [https://github.com/apache/spark/pull/18245] > DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |" > - > > Key: SPARK-20954 > URL: https://issues.apache.org/jira/browse/SPARK-20954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Garros Chan > Fix For: 2.2.0 > > > I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added > to the result. You can see there is this 1 extra row with "| # col_name | > data_type | comment |" ; however, select and select count(*) only shows 1 > row. > I searched online a long time and do not find any useful information. > Is this a bug? > hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline > Beeline version 1.2.1.spark2 by Apache Hive > [INFO] Unable to bind key for unsupported operation: backward-delete-word > [INFO] Unable to bind key for unsupported operation: backward-delete-word > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > beeline> !connect jdbc:hive2://localhost:10016 > Connecting to jdbc:hive2://localhost:10016 > Enter username for jdbc:hive2://localhost:10016: hive > Enter password for jdbc:hive2://localhost:10016: > 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016 > 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016 > 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with > JDBC Uri: jdbc:hive2://localhost:10016 > Connected to: Spark SQL (version 2.2.1-SNAPSHOT) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 0: jdbc:hive2://localhost:10016> describe garros.hivefloat; > +-++--+--+ > | col_name | data_type | comment | > +-++--+--+ > | # col_name | data_type | comment | > | c1 | float | NULL | > +-++--+--+ > 2 rows selected (0.396 seconds) > 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat; > +-+--+ > | c1 | > +-+--+ > | 123.99800109863281 | > +-+--+ > 1 row selected (0.319 seconds) > 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat; > +---+--+ > | count(1) | > +---+--+ > | 1 | > +---+--+ > 1 row selected (0.783 seconds) > 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint; > +---+-+--+--+ > | col_name| data_type > | comment | > +---+-+--+--+ > | # col_name| data_type > | comment | > | c1| int > | NULL | > | | > | | > | # Detailed Table Information | > | | > | Database | garros > | | > | Table | hiveint > | | > | Owner | root > | | > | Created | Thu Feb 09 17:40:36 EST 2017 > | | > | Last Access | Wed Dec 31 19:00:00 EST 1969 > | | > | Type | MANAGED > | | > | Provider | hive
[jira] [Commented] (SPARK-20589) Allow limiting task concurrency per stage
[ https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043877#comment-16043877 ] Fei Shao commented on SPARK-20589: -- Tasks are assigned to executors. If we set the number of executors to 5 and set the simulaneous task number to 2, a contradiction occurs here. So can we change the requirement to "allow limiting task concurrency per executor" please? > Allow limiting task concurrency per stage > - > > Key: SPARK-20589 > URL: https://issues.apache.org/jira/browse/SPARK-20589 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > It would be nice to have the ability to limit the number of concurrent tasks > per stage. This is useful when your spark job might be accessing another > service and you don't want to DOS that service. For instance Spark writing > to hbase or Spark doing http puts on a service. Many times you want to do > this without limiting the number of partitions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21001) Staging folders from Hive table are not being cleared.
[ https://issues.apache.org/jira/browse/SPARK-21001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043869#comment-16043869 ] Liang-Chi Hsieh commented on SPARK-21001: - No, I mean the current 2.0 branch in git. I think there's no 2.0.3 release yet. > Staging folders from Hive table are not being cleared. > -- > > Key: SPARK-21001 > URL: https://issues.apache.org/jira/browse/SPARK-21001 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Ajay Cherukuri > > Staging folders that were being created as a part of Data loading to Hive > table by using spark job, are not cleared. > Staging folder are remaining in Hive External table folders even after Spark > job is completed. > This is the same issue mentioned in the > ticket:https://issues.apache.org/jira/browse/SPARK-18372 > This ticket says the issues was resolved in 1.6.4. But, now i found that it's > still existing on 2.0.2. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043714#comment-16043714 ] Robert Kruszewski edited comment on SPARK-20952 at 6/9/17 2:12 AM: --- ~~Right, but how do I pass it downstream?~~ So I would store it and restore it inside the threadpool? Now every spark contributor has to know about it but if that's preferred happy to modify. was (Author: robert3005): ~Right, but how do I pass it downstream?~ So I would store it and restore it inside the threadpool? Now every spark contributor has to know about it but if that's preferred happy to modify. > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043714#comment-16043714 ] Robert Kruszewski edited comment on SPARK-20952 at 6/9/17 2:13 AM: --- -Right, but how do I pass it downstream?- So I would store it and restore it inside the threadpool? Now every spark contributor has to know about it but if that's preferred happy to modify. was (Author: robert3005): ~~Right, but how do I pass it downstream?~~ So I would store it and restore it inside the threadpool? Now every spark contributor has to know about it but if that's preferred happy to modify. > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18075) UDF doesn't work on non-local spark
[ https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043778#comment-16043778 ] Wenchen Fan commented on SPARK-18075: - for development/testing, you can special the spark master as {{local-cluster[4, 8, 2048]}}, which simulates a 4 nodes spark cluster with 8 cores and 2g ram each node > UDF doesn't work on non-local spark > --- > > Key: SPARK-18075 > URL: https://issues.apache.org/jira/browse/SPARK-18075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Nick Orka > > I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz) > According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 > I've made all spark dependancies with PROVIDED scope. I use 100% same > versions of spark in the app as well as for spark server. > Here is my pom: > {code:title=pom.xml} > > 1.6 > 1.6 > UTF-8 > 2.11.8 > 2.0.0 > 2.7.0 > > > > > org.apache.spark > spark-core_2.11 > ${spark.version} > provided > > > org.apache.spark > spark-sql_2.11 > ${spark.version} > provided > > > org.apache.spark > spark-hive_2.11 > ${spark.version} > provided > > > {code} > As you can see all spark dependencies have provided scope > And this is a code for reproduction: > {code:title=udfTest.scala} > import org.apache.spark.sql.types.{StringType, StructField, StructType} > import org.apache.spark.sql.{Row, SparkSession} > /** > * Created by nborunov on 10/19/16. > */ > object udfTest { > class Seq extends Serializable { > var i = 0 > def getVal: Int = { > i = i + 1 > i > } > } > def main(args: Array[String]) { > val spark = SparkSession > .builder() > .master("spark://nborunov-mbp.local:7077") > // .master("local") > .getOrCreate() > val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two"))) > val schema = StructType(Array(StructField("name", StringType))) > val df = spark.createDataFrame(rdd, schema) > df.show() > spark.udf.register("func", (name: String) => name.toUpperCase) > import org.apache.spark.sql.functions.expr > val newDf = df.withColumn("upperName", expr("func(name)")) > newDf.show() > val seq = new Seq > spark.udf.register("seq", () => seq.getVal) > val seqDf = df.withColumn("id", expr("seq()")) > seqDf.show() > df.createOrReplaceTempView("df") > spark.sql("select *, seq() as sql_id from df").show() > } > } > {code} > When .master("local") - everything works fine. When > .master("spark://...:7077"), it fails on line: > {code} > newDf.show() > {code} > The error is exactly the same: > {code} > scala> udfTest.main(Array()) > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0 > 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov > 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov > 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: > 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: > 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(nborunov); > groups with view permissions: Set(); users with modify permissions: > Set(nborunov); groups with modify permissions: Set() > 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on > port 57828. > 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker > 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster > 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at > /private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264 > 16/10/19 19:37:53 INFO MemoryStore: MemoryStore started with capacity 2004.6 > MB > 16/10/19 19
[jira] [Resolved] (SPARK-20863) Add metrics/instrumentation to LiveListenerBus
[ https://issues.apache.org/jira/browse/SPARK-20863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20863. - Resolution: Fixed Fix Version/s: 2.3.0 > Add metrics/instrumentation to LiveListenerBus > -- > > Key: SPARK-20863 > URL: https://issues.apache.org/jira/browse/SPARK-20863 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.3.0 > > > I think that we should add Coda Hale metrics to the LiveListenerBus in order > to count the number of queued, processed, and dropped events, as well as a > timer tracking per-event processing times. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13933) hadoop-2.7 profile's curator version should be 2.7.1
[ https://issues.apache.org/jira/browse/SPARK-13933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13933: Assignee: (was: Apache Spark) > hadoop-2.7 profile's curator version should be 2.7.1 > > > Key: SPARK-13933 > URL: https://issues.apache.org/jira/browse/SPARK-13933 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Steve Loughran >Priority: Minor > > This is pretty minor, more due diligence than any binary compatibility. > # the {{hadoop-2.7}} profile declares the curator version to be 2.6.0 > # the actual hadoop-2.7.1 dependency is of curator 2.7.1; this came from > HADOOP-11492 > For consistency, the profile can/should be changed. However, note that as > well as some incompatibilities defined in HADOOP-11492; the version of Guava > that curator asserts a need for is 15.x. HADOOP-11612 showed what needed to > be done to address compatibility problems there; one of the Curator classes > had to be forked to make compatible with guava 11+ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13933) hadoop-2.7 profile's curator version should be 2.7.1
[ https://issues.apache.org/jira/browse/SPARK-13933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043748#comment-16043748 ] Apache Spark commented on SPARK-13933: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/18247 > hadoop-2.7 profile's curator version should be 2.7.1 > > > Key: SPARK-13933 > URL: https://issues.apache.org/jira/browse/SPARK-13933 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Steve Loughran >Priority: Minor > > This is pretty minor, more due diligence than any binary compatibility. > # the {{hadoop-2.7}} profile declares the curator version to be 2.6.0 > # the actual hadoop-2.7.1 dependency is of curator 2.7.1; this came from > HADOOP-11492 > For consistency, the profile can/should be changed. However, note that as > well as some incompatibilities defined in HADOOP-11492; the version of Guava > that curator asserts a need for is 15.x. HADOOP-11612 showed what needed to > be done to address compatibility problems there; one of the Curator classes > had to be forked to make compatible with guava 11+ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13933) hadoop-2.7 profile's curator version should be 2.7.1
[ https://issues.apache.org/jira/browse/SPARK-13933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13933: Assignee: Apache Spark > hadoop-2.7 profile's curator version should be 2.7.1 > > > Key: SPARK-13933 > URL: https://issues.apache.org/jira/browse/SPARK-13933 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.0.0 >Reporter: Steve Loughran >Assignee: Apache Spark >Priority: Minor > > This is pretty minor, more due diligence than any binary compatibility. > # the {{hadoop-2.7}} profile declares the curator version to be 2.6.0 > # the actual hadoop-2.7.1 dependency is of curator 2.7.1; this came from > HADOOP-11492 > For consistency, the profile can/should be changed. However, note that as > well as some incompatibilities defined in HADOOP-11492; the version of Guava > that curator asserts a need for is 15.x. HADOOP-11612 showed what needed to > be done to address compatibility problems there; one of the Curator classes > had to be forked to make compatible with guava 11+ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21028) Parallel One vs. Rest Classifier Scala
Ajay Saini created SPARK-21028: -- Summary: Parallel One vs. Rest Classifier Scala Key: SPARK-21028 URL: https://issues.apache.org/jira/browse/SPARK-21028 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.2.0, 2.2.1 Reporter: Ajay Saini Adding a class for a parallel one vs. rest implementation to the ml package in Spark. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21027) Parallel One vs. Rest Classifier
[ https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajay Saini updated SPARK-21027: --- Description: Adding a class called ParOneVsRest that includes support for a parallelism parameter in a one vs. rest implementation. A parallel one vs. rest implementation gives up to a 2X speedup when tested on a dataset with 181024 points. A ticket for the Scala implementation of this classifier is here: https://issues.apache.org/jira/browse/SPARK-21028 (was: Adding a class called ParOneVsRest that includes support for a parallelism parameter in a one vs. rest implementation. A parallel one vs. rest implementation gives up to a 2X speedup when tested on a dataset with 181024 points.) > Parallel One vs. Rest Classifier > > > Key: SPARK-21027 > URL: https://issues.apache.org/jira/browse/SPARK-21027 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Ajay Saini > > Adding a class called ParOneVsRest that includes support for a parallelism > parameter in a one vs. rest implementation. A parallel one vs. rest > implementation gives up to a 2X speedup when tested on a dataset with 181024 > points. A ticket for the Scala implementation of this classifier is here: > https://issues.apache.org/jira/browse/SPARK-21028 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21027) Parallel One vs. Rest Classifier
[ https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ajay Saini updated SPARK-21027: --- Description: Adding a class called ParOneVsRest that includes support for a parallelism parameter in a one vs. rest implementation. A parallel one vs. rest implementation gives up to a 2X speedup when tested on a dataset with 181024 points. (was: Adding a class called ParOneVsRest that includes support for a parallelism parameter in a one vs. rest implementation.) > Parallel One vs. Rest Classifier > > > Key: SPARK-21027 > URL: https://issues.apache.org/jira/browse/SPARK-21027 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 2.2.0, 2.2.1 >Reporter: Ajay Saini > > Adding a class called ParOneVsRest that includes support for a parallelism > parameter in a one vs. rest implementation. A parallel one vs. rest > implementation gives up to a 2X speedup when tested on a dataset with 181024 > points. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21027) Parallel One vs. Rest Classifier
Ajay Saini created SPARK-21027: -- Summary: Parallel One vs. Rest Classifier Key: SPARK-21027 URL: https://issues.apache.org/jira/browse/SPARK-21027 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 2.2.0, 2.2.1 Reporter: Ajay Saini Adding a class called ParOneVsRest that includes support for a parallelism parameter in a one vs. rest implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043732#comment-16043732 ] Shixiong Zhu commented on SPARK-20952: -- For `ParquetFileFormat#readFootersInParallel`, I would suggest that you just set the TaskContext in "parFiles.flatMap". {code} val taskContext = TaskContext.get val parFiles = partFiles.par parFiles.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(8)) parFiles.flatMap { currentFile => TaskContext.setTaskContext(taskContext) ... }.seq {code} In this special case, it's safe since this is a local one-time thread pool. > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043714#comment-16043714 ] Robert Kruszewski edited comment on SPARK-20952 at 6/9/17 12:28 AM: ~Right, but how do I pass it downstream?~ So I would store it and restore it inside the threadpool? Now every spark contributor has to know about it but if that's preferred happy to modify. was (Author: robert3005): Right, but how do I pass it downstream? > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043714#comment-16043714 ] Robert Kruszewski commented on SPARK-20952: --- Right, but how do I pass it downstream? > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043712#comment-16043712 ] Robert Kruszewski commented on SPARK-20952: --- It doesn't but things underneath it do. It's weird from consumer perspective that you have a feature that you can't really use because you can't assert that it behaves consistently. In my case we have some filesystem features relying on taskcontext > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043710#comment-16043710 ] Shixiong Zhu commented on SPARK-20952: -- Although I don't know what you plan to do, you can save the TaskContext into a local variable like this: {code} private[parquet] def readParquetFootersInParallel( conf: Configuration, partFiles: Seq[FileStatus], ignoreCorruptFiles: Boolean): Seq[Footer] = { val taskContext = TaskContext.get val parFiles = partFiles.par parFiles.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(8)) parFiles.flatMap { currentFile => try { // Use `taskContext` rather than `TaskContext.get` // Skips row group information since we only need the schema. // ParquetFileReader.readFooter throws RuntimeException, instead of IOException, // when it can't read the footer. Some(new Footer(currentFile.getPath(), ParquetFileReader.readFooter( conf, currentFile, SKIP_ROW_GROUPS))) } catch { case e: RuntimeException => if (ignoreCorruptFiles) { logWarning(s"Skipped the footer in the corrupted file: $currentFile", e) None } else { throw new IOException(s"Could not read footer for file: $currentFile", e) } } }.seq } {code} > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043708#comment-16043708 ] Shixiong Zhu commented on SPARK-20952: -- Why it needs TaskContext? > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043704#comment-16043704 ] Robert Kruszewski commented on SPARK-20952: --- No modifications, it's this code https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L477 which spins up a threadpool to read files per partition. I imagine there's more cases like this but first one I encountered > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"
[ https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043700#comment-16043700 ] Apache Spark commented on SPARK-20954: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/18245 > DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |" > - > > Key: SPARK-20954 > URL: https://issues.apache.org/jira/browse/SPARK-20954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Garros Chan > > I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added > to the result. You can see there is this 1 extra row with "| # col_name | > data_type | comment |" ; however, select and select count(*) only shows 1 > row. > I searched online a long time and do not find any useful information. > Is this a bug? > hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline > Beeline version 1.2.1.spark2 by Apache Hive > [INFO] Unable to bind key for unsupported operation: backward-delete-word > [INFO] Unable to bind key for unsupported operation: backward-delete-word > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > [INFO] Unable to bind key for unsupported operation: up-history > [INFO] Unable to bind key for unsupported operation: down-history > beeline> !connect jdbc:hive2://localhost:10016 > Connecting to jdbc:hive2://localhost:10016 > Enter username for jdbc:hive2://localhost:10016: hive > Enter password for jdbc:hive2://localhost:10016: > 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016 > 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016 > 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with > JDBC Uri: jdbc:hive2://localhost:10016 > Connected to: Spark SQL (version 2.2.1-SNAPSHOT) > Driver: Hive JDBC (version 1.2.1.spark2) > Transaction isolation: TRANSACTION_REPEATABLE_READ > 0: jdbc:hive2://localhost:10016> describe garros.hivefloat; > +-++--+--+ > | col_name | data_type | comment | > +-++--+--+ > | # col_name | data_type | comment | > | c1 | float | NULL | > +-++--+--+ > 2 rows selected (0.396 seconds) > 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat; > +-+--+ > | c1 | > +-+--+ > | 123.99800109863281 | > +-+--+ > 1 row selected (0.319 seconds) > 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat; > +---+--+ > | count(1) | > +---+--+ > | 1 | > +---+--+ > 1 row selected (0.783 seconds) > 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint; > +---+-+--+--+ > | col_name| data_type > | comment | > +---+-+--+--+ > | # col_name| data_type > | comment | > | c1| int > | NULL | > | | > | | > | # Detailed Table Information | > | | > | Database | garros > | | > | Table | hiveint > | | > | Owner | root > | | > | Created | Thu Feb 09 17:40:36 EST 2017 > | | > | Last Access | Wed Dec 31 19:00:00 EST 1969 > | | > | Type | MANAGED > | | > | Provider | hiv
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043694#comment-16043694 ] Shixiong Zhu commented on SPARK-20952: -- [~robert3005] could you show me your codes? Are you modifying "ParquetFileFormat#readFootersInParallel"? > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043669#comment-16043669 ] Robert Kruszewski commented on SPARK-20952: --- I am not really attached to the solution. Would be happy to implement anything that maintainers are happy with as long as it ensures we get taskcontext always anywhere on the task side. For instance issue I am facing now is that ParquetFileFormat#readFootersInParallel is not able to access it leading to failures. > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043655#comment-16043655 ] Shixiong Zhu commented on SPARK-20952: -- If TaskContext is not inheritable, we can always find a way to pass it to the codes that need to access it. But if it's inheritable, it's pretty hard to avoid TaskContext pollution (or avoid using a stale TaskContext, you have to always set it manually in a task running in a cached thread). [~joshrosen] listed many tickets that are caused by localProperties is InheritableThreadLocal: https://issues.apache.org/jira/browse/SPARK-14686?focusedCommentId=15244478&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15244478 > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20211) `1 > 0.0001` throws Decimal scale (0) cannot be greater than precision (-2) exception
[ https://issues.apache.org/jira/browse/SPARK-20211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043639#comment-16043639 ] Apache Spark commented on SPARK-20211: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/18244 > `1 > 0.0001` throws Decimal scale (0) cannot be greater than precision (-2) > exception > - > > Key: SPARK-20211 > URL: https://issues.apache.org/jira/browse/SPARK-20211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1 >Reporter: StanZhai > Labels: correctness > > The following SQL: > {code} > select 1 > 0.0001 from tb > {code} > throws Decimal scale (0) cannot be greater than precision (-2) exception in > Spark 2.x. > `floor(0.0001)` and `ceil(0.0001)` have the same problem in Spark 1.6.x and > Spark 2.x. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures
[ https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043622#comment-16043622 ] Josh Rosen commented on SPARK-20178: Update: I commented over on https://github.com/apache/spark/pull/18150#discussion_r121018254. I now think that [~sitalke...@gmail.com]'s original approach is a good move for now. If there's controversy then I propose to add an experimental feature-flag to let users fall back to older behavior. > Improve Scheduler fetch failures > > > Key: SPARK-20178 > URL: https://issues.apache.org/jira/browse/SPARK-20178 > Project: Spark > Issue Type: Epic > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Thomas Graves > > We have been having a lot of discussions around improving the handling of > fetch failures. There are 4 jira currently related to this. > We should try to get a list of things we want to improve and come up with one > cohesive design. > SPARK-20163, SPARK-20091, SPARK-14649 , and SPARK-19753 > I will put my initial thoughts in a follow on comment. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal
[ https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043623#comment-16043623 ] Robert Kruszewski commented on SPARK-20952: --- This is already an issue though on driver side (that threadpool is driver side which already has inheritable thread pool). This issue is only so we have same behaviour on executors and driver > TaskContext should be an InheritableThreadLocal > --- > > Key: SPARK-20952 > URL: https://issues.apache.org/jira/browse/SPARK-20952 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Robert Kruszewski >Priority: Minor > > TaskContext is a ThreadLocal as a result when you fork a thread inside your > executor task you lose the handle on the original context set by the > executor. We should change it to InheritableThreadLocal so we can access it > inside thread pools on executors. > See ParquetFileFormat#readFootersInParallel for example of code that uses > thread pools inside the tasks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20953) Add hash map metrics to aggregate and join
[ https://issues.apache.org/jira/browse/SPARK-20953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043581#comment-16043581 ] Reynold Xin commented on SPARK-20953: - I'd show the avg in the UI if possible. As a matter of fact maybe only show the avg. > Add hash map metrics to aggregate and join > -- > > Key: SPARK-20953 > URL: https://issues.apache.org/jira/browse/SPARK-20953 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > It would be useful if we can identify hash map collision issues early on. > We should add avg hash map probe metric to aggregate operator and hash join > operator and report them. If the avg probe is greater than a specific > (configurable) threshold, we should log an error at runtime. > The primary classes to look at are UnsafeFixedWidthAggregationMap, > HashAggregateExec, HashedRelation, HashJoin. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster
[ https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043550#comment-16043550 ] Nico Pappagianis commented on SPARK-10795: -- [~HackerWilson] Were you able to resolve this? I'm hitting the same thing running Spark 2.0.1 and Hadoop 2.7.2. My Python code is just creating a SparkContext and then calling sc.stop(). In the YARN logs I see: INFO: 2017-06-08 22:16:24,462 INFO [main] yarn.Client - Uploading resource file:/home/.../python/lib/py4j-0.10.1-src.zip -> hdfs://.../.sparkStaging/application_1494012577752_1403/py4j-0.10.1-src.zip when I do an fs -ls on the above HDFS directory it shows the py4j file, but the job fails with a FileNotFoundException for the py4j file above: File does not exist: hdfs://.../.sparkStaging/application_1494012577752_1403/py4j-0.10.1-src.zip (stack trace here: https://gist.github.com/anonymous/5506654b88e19e6f51ffbd85cd3f25ee) One thing to note is that I am launching a Map-only job that launches a the Spark application on the cluster. The launcher job is using SparkLauncher (Java). Master and deploy mode are set to "yarn" and "cluster", respectively. When I submit the Python job from via a spark-submit it runs successfully (I set the HADOOP_CONF_DIR and HADOOP_JAVA_HOME to the same as what I am setting using the launcher job). > FileNotFoundException while deploying pyspark job on cluster > > > Key: SPARK-10795 > URL: https://issues.apache.org/jira/browse/SPARK-10795 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: EMR >Reporter: Harshit > > I am trying to run simple spark job using pyspark, it works as standalone , > but while I deploy over cluster it fails. > Events : > 2015-09-24 10:38:49,602 INFO [main] yarn.Client (Logging.scala:logInfo(59)) > - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> > hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > Above uploading resource file is successfull , I manually checked file is > present in above specified path , but after a while I face following error : > Diagnostics: File does not exist: > hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > java.io.FileNotFoundException: File does not exist: > hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1
[ https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043478#comment-16043478 ] Cheng Lian commented on SPARK-20958: [~marmbrus], here is the draft release note entry: {quote} SPARK-20958: For users who use parquet-avro together with Spark 2.2, please use parquet-avro 1.8.1 instead of parquet-avro 1.8.2. This is because parquet-avro 1.8.2 upgrades avro from 1.7.6 to 1.8.1, which is backward incompatible with 1.7.6. {quote} > Roll back parquet-mr 1.8.2 to parquet-1.8.1 > --- > > Key: SPARK-20958 > URL: https://issues.apache.org/jira/browse/SPARK-20958 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Labels: release-notes, release_notes, releasenotes > > We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on > avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 > and avro 1.7.7 used by spark-core 2.2.0-rc2. > Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro > (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the > reasons mentioned in [PR > #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. > Therefore, we don't really have many choices here and have to roll back > parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21026) Document jenkins plug-ins assumed by the spark documentation build
Erik Erlandson created SPARK-21026: -- Summary: Document jenkins plug-ins assumed by the spark documentation build Key: SPARK-21026 URL: https://issues.apache.org/jira/browse/SPARK-21026 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 2.1.1 Reporter: Erik Erlandson I haven't been able to find documentation on what plug-ins the spark doc build assumes for jenkins. Is there a list somewhere, or a gemfile? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21018) "Completed Jobs" and "Completed Stages" support pagination
[ https://issues.apache.org/jira/browse/SPARK-21018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Bozarth resolved SPARK-21018. -- Resolution: Duplicate This was added in Spark 2.1 by SPARK-15590 > "Completed Jobs" and "Completed Stages" support pagination > -- > > Key: SPARK-21018 > URL: https://issues.apache.org/jira/browse/SPARK-21018 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.2 >Reporter: Jinhua Fu >Priority: Minor > Attachments: CompletedJobs.png, PagedTasks.png > > > When using Thriftsever, the number of jobs and Stages may be very large, and > if not paginated, the page will be very long and slow to load, especially > when spark.ui.retainedJobs is set to a large value. So I suggest "completed > Jobs" and "completed Stages" support pagination. > I'd like to change them to a paging display similar to the tasks in the > "Details for Stage" page. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1
[ https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-20958: --- Labels: release-notes release_notes releasenotes (was: release-notes) > Roll back parquet-mr 1.8.2 to parquet-1.8.1 > --- > > Key: SPARK-20958 > URL: https://issues.apache.org/jira/browse/SPARK-20958 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Labels: release-notes, release_notes, releasenotes > > We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on > avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 > and avro 1.7.7 used by spark-core 2.2.0-rc2. > Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro > (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the > reasons mentioned in [PR > #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. > Therefore, we don't really have many choices here and have to roll back > parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-21025) missing data in jsc.union
[ https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meng xi updated SPARK-21025: Comment: was deleted (was: I attached the Java file) > missing data in jsc.union > - > > Key: SPARK-21025 > URL: https://issues.apache.org/jira/browse/SPARK-21025 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0, 2.1.1 > Environment: Ubuntu 16.04 >Reporter: meng xi > Attachments: SparkTest.java > > > we are using an iterator of RDD for some special data processing, and then > using union to rebuild a new RDD. we found the result RDD are often empty or > missing most of the data. Here is a simplified code snippet for this bug: > SparkConf sparkConf = new > SparkConf().setAppName("Test").setMaster("local[*]"); > SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); > JavaSparkContext jsc = > JavaSparkContext.fromSparkContext(sparkContext); > JavaRDD src = jsc.parallelize(IntStream.range(0, > 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); > Iterator it = src.toLocalIterator(); > List> rddList = new LinkedList<>(); > List resultBuffer = new LinkedList<>(); > while (it.hasNext()) { > resultBuffer.add(it.next()); > if (resultBuffer.size() == 1000) { > JavaRDD rdd = jsc.parallelize(resultBuffer); > //rdd.count(); > rddList.add(rdd); > resultBuffer.clear(); > } > } > JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), > rddList); > System.out.println(desc.count()); > this code should duplicate the original RDD, but it just returns an empty > RDD. Please note that if I uncomment the rdd.count, it will return the > correct result. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21025) missing data in jsc.union
[ https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meng xi reopened SPARK-21025: - I attached the Java file > missing data in jsc.union > - > > Key: SPARK-21025 > URL: https://issues.apache.org/jira/browse/SPARK-21025 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0, 2.1.1 > Environment: Ubuntu 16.04 >Reporter: meng xi > Attachments: SparkTest.java > > > we are using an iterator of RDD for some special data processing, and then > using union to rebuild a new RDD. we found the result RDD are often empty or > missing most of the data. Here is a simplified code snippet for this bug: > SparkConf sparkConf = new > SparkConf().setAppName("Test").setMaster("local[*]"); > SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); > JavaSparkContext jsc = > JavaSparkContext.fromSparkContext(sparkContext); > JavaRDD src = jsc.parallelize(IntStream.range(0, > 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); > Iterator it = src.toLocalIterator(); > List> rddList = new LinkedList<>(); > List resultBuffer = new LinkedList<>(); > while (it.hasNext()) { > resultBuffer.add(it.next()); > if (resultBuffer.size() == 1000) { > JavaRDD rdd = jsc.parallelize(resultBuffer); > //rdd.count(); > rddList.add(rdd); > resultBuffer.clear(); > } > } > JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), > rddList); > System.out.println(desc.count()); > this code should duplicate the original RDD, but it just returns an empty > RDD. Please note that if I uncomment the rdd.count, it will return the > correct result. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21025) missing data in jsc.union
[ https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043390#comment-16043390 ] meng xi commented on SPARK-21025: - I attached the JAVA file, which does not have the format issue > missing data in jsc.union > - > > Key: SPARK-21025 > URL: https://issues.apache.org/jira/browse/SPARK-21025 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0, 2.1.1 > Environment: Ubuntu 16.04 >Reporter: meng xi > Attachments: SparkTest.java > > > we are using an iterator of RDD for some special data processing, and then > using union to rebuild a new RDD. we found the result RDD are often empty or > missing most of the data. Here is a simplified code snippet for this bug: > SparkConf sparkConf = new > SparkConf().setAppName("Test").setMaster("local[*]"); > SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); > JavaSparkContext jsc = > JavaSparkContext.fromSparkContext(sparkContext); > JavaRDD src = jsc.parallelize(IntStream.range(0, > 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); > Iterator it = src.toLocalIterator(); > List> rddList = new LinkedList<>(); > List resultBuffer = new LinkedList<>(); > while (it.hasNext()) { > resultBuffer.add(it.next()); > if (resultBuffer.size() == 1000) { > JavaRDD rdd = jsc.parallelize(resultBuffer); > //rdd.count(); > rddList.add(rdd); > resultBuffer.clear(); > } > } > JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), > rddList); > System.out.println(desc.count()); > this code should duplicate the original RDD, but it just returns an empty > RDD. Please note that if I uncomment the rdd.count, it will return the > correct result. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21025) missing data in jsc.union
[ https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meng xi updated SPARK-21025: Attachment: SparkTest.java > missing data in jsc.union > - > > Key: SPARK-21025 > URL: https://issues.apache.org/jira/browse/SPARK-21025 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0, 2.1.1 > Environment: Ubuntu 16.04 >Reporter: meng xi > Attachments: SparkTest.java > > > we are using an iterator of RDD for some special data processing, and then > using union to rebuild a new RDD. we found the result RDD are often empty or > missing most of the data. Here is a simplified code snippet for this bug: > SparkConf sparkConf = new > SparkConf().setAppName("Test").setMaster("local[*]"); > SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); > JavaSparkContext jsc = > JavaSparkContext.fromSparkContext(sparkContext); > JavaRDD src = jsc.parallelize(IntStream.range(0, > 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); > Iterator it = src.toLocalIterator(); > List> rddList = new LinkedList<>(); > List resultBuffer = new LinkedList<>(); > while (it.hasNext()) { > resultBuffer.add(it.next()); > if (resultBuffer.size() == 1000) { > JavaRDD rdd = jsc.parallelize(resultBuffer); > //rdd.count(); > rddList.add(rdd); > resultBuffer.clear(); > } > } > JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), > rddList); > System.out.println(desc.count()); > this code should duplicate the original RDD, but it just returns an empty > RDD. Please note that if I uncomment the rdd.count, it will return the > correct result. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21025) missing data in jsc.union
[ https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meng xi updated SPARK-21025: Description: we are using an iterator of RDD for some special data processing, and then using union to rebuild a new RDD. we found the result RDD are often empty or missing most of the data. Here is a simplified code snippet for this bug: SparkConf sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]"); SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext); JavaRDD src = jsc.parallelize(IntStream.range(0, 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); Iterator it = src.toLocalIterator(); List> rddList = new LinkedList<>(); List resultBuffer = new LinkedList<>(); while (it.hasNext()) { resultBuffer.add(it.next()); if (resultBuffer.size() == 1000) { JavaRDD rdd = jsc.parallelize(resultBuffer); //rdd.count(); rddList.add(rdd); resultBuffer.clear(); } } JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), rddList); System.out.println(desc.count()); this code should duplicate the original RDD, but it just returns an empty RDD. Please note that if I uncomment the rdd.count, it will return the correct result. was: we are using an iterator of RDD for some special data processing, and then using union to rebuild a new RDD. we found the result RDD are often empty or missing most of the data. Here is a simplified code snippet for this bug: SparkConf sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]"); SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext); JavaRDD src = jsc.parallelize(IntStream.range(0, 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); Iterator it = src.toLocalIterator(); List> rddList = new LinkedList<>(); List resultBuffer = new LinkedList<>(); while (it.hasNext()) { resultBuffer.add(it.next()); if (resultBuffer.size() == 1000) { JavaRDD rdd = jsc.parallelize(resultBuffer); //rdd.count(); rddList.add(rdd); resultBuffer.clear(); } } JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), rddList); System.out.println(desc.count()); this code should duplicate the original RDD, but it just returns an empty RDD. Please note that if I uncomment the rdd.count, it will return the correct result. > missing data in jsc.union > - > > Key: SPARK-21025 > URL: https://issues.apache.org/jira/browse/SPARK-21025 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0, 2.1.1 > Environment: Ubuntu 16.04 >Reporter: meng xi > > we are using an iterator of RDD for some special data processing, and then > using union to rebuild a new RDD. we found the result RDD are often empty or > missing most of the data. Here is a simplified code snippet for this bug: > SparkConf sparkConf = new > SparkConf().setAppName("Test").setMaster("local[*]"); > SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); > JavaSparkContext jsc = > JavaSparkContext.fromSparkContext(sparkContext); > JavaRDD src = jsc.parallelize(IntStream.range(0, > 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); > Iterator it = src.toLocalIterator(); > List> rddList = new LinkedList<>(); > List resultBuffer = new LinkedList<>(); > while (it.hasNext()) { > resultBuffer.add(it.next()); > if (resultBuffer.size() == 1000) { > JavaRDD rdd = jsc.parallelize(resultBuffer); > //rdd.count(); > rddList.add(rdd); > resultBuffer.clear(); > } > } > JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), > rddList); > System.out.println(desc.count()); > this code should duplicate the original RDD, but it just returns an empty > RDD. Please note that if I uncomment the rdd.count, it will return the > correct result. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21025) missing data in jsc.union
[ https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meng xi updated SPARK-21025: Description: we are using an iterator of RDD for some special data processing, and then using union to rebuild a new RDD. we found the result RDD are often empty or missing most of the data. Here is a simplified code snippet for this bug: SparkConf sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]"); SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext); JavaRDD src = jsc.parallelize(IntStream.range(0, 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); Iterator it = src.toLocalIterator(); List> rddList = new LinkedList<>(); List resultBuffer = new LinkedList<>(); while (it.hasNext()) { resultBuffer.add(it.next()); if (resultBuffer.size() == 1000) { JavaRDD rdd = jsc.parallelize(resultBuffer); //rdd.count(); rddList.add(rdd); resultBuffer.clear(); } } JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), rddList); System.out.println(desc.count()); this code should duplicate the original RDD, but it just returns an empty RDD. Please note that if I uncomment the rdd.count, it will return the correct result. was: we are using an iterator of RDD for some special data processing, and then using union to rebuild a new RDD. we found the result RDD are often empty or missing most of the data. Here is a simplified code snippet for this bug: SparkConf sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]"); SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext); JavaRDD src = jsc.parallelize(IntStream.range(0, 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); Iterator it = src.toLocalIterator(); List> rddList = new LinkedList<>(); List resultBuffer = new LinkedList<>(); while (it.hasNext()) { resultBuffer.add(it.next()); if (resultBuffer.size() == 1000) { JavaRDD rdd = jsc.parallelize(resultBuffer); //rdd.count(); rddList.add(rdd); resultBuffer.clear(); } } JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), rddList); System.out.println(desc.count()); this code should duplicate the original RDD, but it just returns an empty RDD. Please note that if I uncomment the rdd.count, it will return the correct result. > missing data in jsc.union > - > > Key: SPARK-21025 > URL: https://issues.apache.org/jira/browse/SPARK-21025 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0, 2.1.1 > Environment: Ubuntu 16.04 >Reporter: meng xi > > we are using an iterator of RDD for some special data processing, and then > using union to rebuild a new RDD. we found the result RDD are often empty or > missing most of the data. Here is a simplified code snippet for this bug: > SparkConf sparkConf = new > SparkConf().setAppName("Test").setMaster("local[*]"); > SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); > JavaSparkContext jsc = > JavaSparkContext.fromSparkContext(sparkContext); > JavaRDD src = jsc.parallelize(IntStream.range(0, > 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); > Iterator it = src.toLocalIterator(); > List> rddList = new LinkedList<>(); > List resultBuffer = new LinkedList<>(); > while (it.hasNext()) { > resultBuffer.add(it.next()); > if (resultBuffer.size() == 1000) { > JavaRDD rdd = jsc.parallelize(resultBuffer); > //rdd.count(); > rddList.add(rdd); > resultBuffer.clear(); > } > } > JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), > rddList); > System.out.println(desc.count()); > this code should duplicate the original RDD, but it just returns an empty > RDD. Please note that if I uncomment the rdd.count, it will return the > correct result. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21025) missing data in jsc.union
[ https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meng xi updated SPARK-21025: Description: we are using an iterator of RDD for some special data processing, and then using union to rebuild a new RDD. we found the result RDD are often empty or missing most of the data. Here is a simplified code snippet for this bug: SparkConf sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]"); SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext); JavaRDD src = jsc.parallelize(IntStream.range(0, 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); Iterator it = src.toLocalIterator(); List> rddList = new LinkedList<>(); List resultBuffer = new LinkedList<>(); while (it.hasNext()) { resultBuffer.add(it.next()); if (resultBuffer.size() == 1000) { JavaRDD rdd = jsc.parallelize(resultBuffer); //rdd.count(); rddList.add(rdd); resultBuffer.clear(); } } JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), rddList); System.out.println(desc.count()); this code should duplicate the original RDD, but it just returns an empty RDD. Please note that if I uncomment the rdd.count, it will return the correct result. was: we are using an iterator of RDD for some special data processing, and then using union to rebuild a new RDD. we found the result RDD are often empty or missing most of the data. Here is a simplified code snippet for this bug: SparkConf sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]"); SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext); JavaRDD src = jsc.parallelize(IntStream.range(0, 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); Iterator it = src.toLocalIterator(); List> rddList = new LinkedList<>(); List resultBuffer = new LinkedList<>(); while (it.hasNext()) { resultBuffer.add(it.next()); if (resultBuffer.size() == 1000) { JavaRDD rdd = jsc.parallelize(resultBuffer); //rdd.count(); rddList.add(rdd); resultBuffer.clear(); } } JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), rddList); System.out.println(desc.count()); this code should duplicate the original RDD, but it just returns an empty RDD. Please note that if I uncomment the rdd.count, it will return the correct result. > missing data in jsc.union > - > > Key: SPARK-21025 > URL: https://issues.apache.org/jira/browse/SPARK-21025 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0, 2.1.1 > Environment: Ubuntu 16.04 >Reporter: meng xi > > we are using an iterator of RDD for some special data processing, and then > using union to rebuild a new RDD. we found the result RDD are often empty or > missing most of the data. Here is a simplified code snippet for this bug: > SparkConf sparkConf = new > SparkConf().setAppName("Test").setMaster("local[*]"); > SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); > JavaSparkContext jsc = > JavaSparkContext.fromSparkContext(sparkContext); > JavaRDD src = jsc.parallelize(IntStream.range(0, > 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); > Iterator it = src.toLocalIterator(); > List> rddList = new LinkedList<>(); > List resultBuffer = new LinkedList<>(); > while (it.hasNext()) { > resultBuffer.add(it.next()); > if (resultBuffer.size() == 1000) { > JavaRDD rdd = jsc.parallelize(resultBuffer); > //rdd.count(); > rddList.add(rdd); > resultBuffer.clear(); > } > } > JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), > rddList); > System.out.println(desc.count()); > this code should duplicate the original RDD, but it just returns an empty > RDD. Please note that if I uncomment the rdd.count, it will return the > correct result. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21025) missing data in jsc.union
[ https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meng xi updated SPARK-21025: Description: we are using an iterator of RDD for some special data processing, and then using union to rebuild a new RDD. we found the result RDD are often empty or missing most of the data. Here is a simplified code snippet for this bug: SparkConf sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]"); SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext); JavaRDD src = jsc.parallelize(IntStream.range(0, 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); Iterator it = src.toLocalIterator(); List> rddList = new LinkedList<>(); List resultBuffer = new LinkedList<>(); while (it.hasNext()) { resultBuffer.add(it.next()); if (resultBuffer.size() == 1000) { JavaRDD rdd = jsc.parallelize(resultBuffer); //rdd.count(); rddList.add(rdd); resultBuffer.clear(); } } JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), rddList); System.out.println(desc.count()); this code should duplicate the original RDD, but it just returns an empty RDD. Please note that if I uncomment the rdd.count, it will return the correct result. was: we are using an iterator of RDD for some special data processing, and then using union to rebuild a new RDD. we found the result RDD are often empty or missing most of the data. Here is a simplified code snippet for this bug: SparkConf sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]"); SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext); JavaRDD src = jsc.parallelize(IntStream.range(0, 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); Iterator it = src.toLocalIterator(); List> rddList = new LinkedList<>(); List resultBuffer = new LinkedList<>(); while (it.hasNext()) { resultBuffer.add(it.next()); if (resultBuffer.size() == 1000) { JavaRDD rdd = jsc.parallelize(resultBuffer); //rdd.count(); rddList.add(rdd); resultBuffer.clear(); } } JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), rddList); System.out.println(desc.count()); this code should duplicate the original RDD, but it just returns an empty RDD. Please note that if I uncomment the rdd.count, it will return the correct result. > missing data in jsc.union > - > > Key: SPARK-21025 > URL: https://issues.apache.org/jira/browse/SPARK-21025 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0, 2.1.1 > Environment: Ubuntu 16.04 >Reporter: meng xi > > we are using an iterator of RDD for some special data processing, and then > using union to rebuild a new RDD. we found the result RDD are often empty or > missing most of the data. Here is a simplified code snippet for this bug: > SparkConf sparkConf = new > SparkConf().setAppName("Test").setMaster("local[*]"); > SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); > JavaSparkContext jsc = > JavaSparkContext.fromSparkContext(sparkContext); > JavaRDD src = jsc.parallelize(IntStream.range(0, > 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); > Iterator it = src.toLocalIterator(); > List> rddList = new LinkedList<>(); > List resultBuffer = new LinkedList<>(); > while (it.hasNext()) { > resultBuffer.add(it.next()); > if (resultBuffer.size() == 1000) { > JavaRDD rdd = jsc.parallelize(resultBuffer); > //rdd.count(); > rddList.add(rdd); > resultBuffer.clear(); > } > } > JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), > rddList); > System.out.println(desc.count()); > this code should duplicate the original RDD, but it just returns an empty > RDD. Please note that if I uncomment the rdd.count, it will return the > correct result. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21025) missing data in jsc.union
[ https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043379#comment-16043379 ] meng xi commented on SPARK-21025: - no, I just comment out one line, the system incorrectly format my code in this way... Okey, let me explain a little bit about our code logic: we would like to do a "carry forward" data cleansing, which uses the previous data point to fill up missing field in current data. After scan the whole RDD, we reconstruct the RDD. this snippet is just clone the original one, but if you run it, the result RDD is empty > missing data in jsc.union > - > > Key: SPARK-21025 > URL: https://issues.apache.org/jira/browse/SPARK-21025 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0, 2.1.1 > Environment: Ubuntu 16.04 >Reporter: meng xi > > we are using an iterator of RDD for some special data processing, and then > using union to rebuild a new RDD. we found the result RDD are often empty or > missing most of the data. Here is a simplified code snippet for this bug: > SparkConf sparkConf = new > SparkConf().setAppName("Test").setMaster("local[*]"); > SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); > JavaSparkContext jsc = > JavaSparkContext.fromSparkContext(sparkContext); > JavaRDD src = jsc.parallelize(IntStream.range(0, > 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); > Iterator it = src.toLocalIterator(); > List> rddList = new LinkedList<>(); > List resultBuffer = new LinkedList<>(); > while (it.hasNext()) { > resultBuffer.add(it.next()); > if (resultBuffer.size() == 1000) { > JavaRDD rdd = jsc.parallelize(resultBuffer); > //rdd.count(); > rddList.add(rdd); > resultBuffer.clear(); > } > } > JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), > rddList); > System.out.println(desc.count()); > this code should duplicate the original RDD, but it just returns an empty > RDD. Please note that if I uncomment the rdd.count, it will return the > correct result. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21022) RDD.foreach swallows exceptions
[ https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Woodbury closed SPARK-21022. -- Resolution: Invalid Wasn't actually a bug - `foreach` _doesn't_ actually swallow exceptions. > RDD.foreach swallows exceptions > --- > > Key: SPARK-21022 > URL: https://issues.apache.org/jira/browse/SPARK-21022 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Colin Woodbury >Assignee: Shixiong Zhu >Priority: Minor > > A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown > inside its closure, but not if the exception was thrown earlier in the call > chain. An example: > {code:none} > package examples > import org.apache.spark._ > object Shpark { >def main(args: Array[String]) { > implicit val sc: SparkContext = new SparkContext( >new SparkConf().setMaster("local[*]").setAppName("blahfoobar") > ) > /* DOESN'T THROW > > sc.parallelize(0 until 1000) > >.foreachPartition { _.map { i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}} > > */ > /* DOESN'T THROW, nor does anything print. > > * Commenting out the exception runs the prints. > > * (i.e. `foreach` is sufficient to "run" an RDD) > > sc.parallelize(0 until 10) > >.foreach({ i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}) > > */ > /* Throws! */ > sc.parallelize(0 until 10) >.map({ i => > println("BEFORE THROW") > throw new Exception("Testing exception handling") > i >}) >.foreach(i => println(i)) > println("JOB DONE!") > System.in.read > sc.stop() >} > } > {code} > When exceptions are swallowed, the jobs don't seem to fail, and the driver > exits normally. When one _is_ thrown, as in the last example, the exception > successfully rises up to the driver and can be caught with try/catch. > The expected behaviour is for exceptions in `foreach` to throw and crash the > driver, as they would with `map`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21022) RDD.foreach swallows exceptions
[ https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043368#comment-16043368 ] Colin Woodbury commented on SPARK-21022: Ah ok, that makes sense for `foreachPartition`. And wouldn't you know, I retried my tests with `foreach`, and they _do_ throw now. I swear they weren't this morning :S Anyway, it looks like this isn't a bug after all. Thanks for the confirmation. > RDD.foreach swallows exceptions > --- > > Key: SPARK-21022 > URL: https://issues.apache.org/jira/browse/SPARK-21022 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Colin Woodbury >Assignee: Shixiong Zhu >Priority: Minor > > A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown > inside its closure, but not if the exception was thrown earlier in the call > chain. An example: > {code:none} > package examples > import org.apache.spark._ > object Shpark { >def main(args: Array[String]) { > implicit val sc: SparkContext = new SparkContext( >new SparkConf().setMaster("local[*]").setAppName("blahfoobar") > ) > /* DOESN'T THROW > > sc.parallelize(0 until 1000) > >.foreachPartition { _.map { i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}} > > */ > /* DOESN'T THROW, nor does anything print. > > * Commenting out the exception runs the prints. > > * (i.e. `foreach` is sufficient to "run" an RDD) > > sc.parallelize(0 until 10) > >.foreach({ i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}) > > */ > /* Throws! */ > sc.parallelize(0 until 10) >.map({ i => > println("BEFORE THROW") > throw new Exception("Testing exception handling") > i >}) >.foreach(i => println(i)) > println("JOB DONE!") > System.in.read > sc.stop() >} > } > {code} > When exceptions are swallowed, the jobs don't seem to fail, and the driver > exits normally. When one _is_ thrown, as in the last example, the exception > successfully rises up to the driver and can be caught with try/catch. > The expected behaviour is for exceptions in `foreach` to throw and crash the > driver, as they would with `map`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21024) CSV parse mode handles Univocity parser exceptions
[ https://issues.apache.org/jira/browse/SPARK-21024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043310#comment-16043310 ] Xiao Li commented on SPARK-21024: - Yes! We should fix it. Thanks! > CSV parse mode handles Univocity parser exceptions > -- > > Key: SPARK-21024 > URL: https://issues.apache.org/jira/browse/SPARK-21024 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > The current master cannot skip the illegal records that Univocity parsers: > This comes from the spark-user mailing list: > https://www.mail-archive.com/user@spark.apache.org/msg63985.html > {code} > scala> Seq("0,1", "0,1,2,3").toDF().write.text("/Users/maropu/Desktop/data") > scala> val df = spark.read.format("csv").schema("a int, b > int").option("maxColumns", "3").load("/Users/maropu/Desktop/data") > scala> df.show > com.univocity.parsers.common.TextParsingException: > java.lang.ArrayIndexOutOfBoundsException - 3 > Hint: Number of columns processed may have exceeded limit of 3 columns. Use > settings.setMaxColumns(int) to define the maximum number of columns your > input can have > Ensure your configuration is correct, with delimiters, quotes and escape > sequences that match the input format you are trying to parse > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > ... > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.handleEOF(AbstractParser.java:195) > at > com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:544) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:191) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308) > at > org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:60) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > ... > {code} > We could easily fix this like: > https://github.com/apache/spark/compare/master...maropu:HandleExceptionInParser -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-21022) RDD.foreach swallows exceptions
[ https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-21022: - Comment: was deleted (was: ~~Good catch...~~) > RDD.foreach swallows exceptions > --- > > Key: SPARK-21022 > URL: https://issues.apache.org/jira/browse/SPARK-21022 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Colin Woodbury >Assignee: Shixiong Zhu >Priority: Minor > > A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown > inside its closure, but not if the exception was thrown earlier in the call > chain. An example: > {code:none} > package examples > import org.apache.spark._ > object Shpark { >def main(args: Array[String]) { > implicit val sc: SparkContext = new SparkContext( >new SparkConf().setMaster("local[*]").setAppName("blahfoobar") > ) > /* DOESN'T THROW > > sc.parallelize(0 until 1000) > >.foreachPartition { _.map { i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}} > > */ > /* DOESN'T THROW, nor does anything print. > > * Commenting out the exception runs the prints. > > * (i.e. `foreach` is sufficient to "run" an RDD) > > sc.parallelize(0 until 10) > >.foreach({ i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}) > > */ > /* Throws! */ > sc.parallelize(0 until 10) >.map({ i => > println("BEFORE THROW") > throw new Exception("Testing exception handling") > i >}) >.foreach(i => println(i)) > println("JOB DONE!") > System.in.read > sc.stop() >} > } > {code} > When exceptions are swallowed, the jobs don't seem to fail, and the driver > exits normally. When one _is_ thrown, as in the last example, the exception > successfully rises up to the driver and can be caught with try/catch. > The expected behaviour is for exceptions in `foreach` to throw and crash the > driver, as they would with `map`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21022) RDD.foreach swallows exceptions
[ https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043266#comment-16043266 ] Shixiong Zhu commented on SPARK-21022: -- Wait. I also checked `foreach` method. It does throw the exception. It's probably just you missed the exception due to lots of logs output? > RDD.foreach swallows exceptions > --- > > Key: SPARK-21022 > URL: https://issues.apache.org/jira/browse/SPARK-21022 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Colin Woodbury >Assignee: Shixiong Zhu >Priority: Minor > > A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown > inside its closure, but not if the exception was thrown earlier in the call > chain. An example: > {code:none} > package examples > import org.apache.spark._ > object Shpark { >def main(args: Array[String]) { > implicit val sc: SparkContext = new SparkContext( >new SparkConf().setMaster("local[*]").setAppName("blahfoobar") > ) > /* DOESN'T THROW > > sc.parallelize(0 until 1000) > >.foreachPartition { _.map { i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}} > > */ > /* DOESN'T THROW, nor does anything print. > > * Commenting out the exception runs the prints. > > * (i.e. `foreach` is sufficient to "run" an RDD) > > sc.parallelize(0 until 10) > >.foreach({ i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}) > > */ > /* Throws! */ > sc.parallelize(0 until 10) >.map({ i => > println("BEFORE THROW") > throw new Exception("Testing exception handling") > i >}) >.foreach(i => println(i)) > println("JOB DONE!") > System.in.read > sc.stop() >} > } > {code} > When exceptions are swallowed, the jobs don't seem to fail, and the driver > exits normally. When one _is_ thrown, as in the last example, the exception > successfully rises up to the driver and can be caught with try/catch. > The expected behaviour is for exceptions in `foreach` to throw and crash the > driver, as they would with `map`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21025) missing data in jsc.union
[ https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21025. --- Resolution: Invalid Oh, and I realized the mistake here. You put the body of the loop on one line and attempted to comment out just one statement, but you comment out all of the other statements, including the one that updates resultbuffer. > missing data in jsc.union > - > > Key: SPARK-21025 > URL: https://issues.apache.org/jira/browse/SPARK-21025 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0, 2.1.1 > Environment: Ubuntu 16.04 >Reporter: meng xi > > we are using an iterator of RDD for some special data processing, and then > using union to rebuild a new RDD. we found the result RDD are often empty or > missing most of the data. Here is a simplified code snippet for this bug: > SparkConf sparkConf = new > SparkConf().setAppName("Test").setMaster("local[*]"); > SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); > JavaSparkContext jsc = > JavaSparkContext.fromSparkContext(sparkContext); > JavaRDD src = jsc.parallelize(IntStream.range(0, > 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); > Iterator it = src.toLocalIterator(); > List> rddList = new LinkedList<>(); > List resultBuffer = new LinkedList<>(); > while (it.hasNext()) { > resultBuffer.add(it.next()); > if (resultBuffer.size() == 1000) { > JavaRDD rdd = jsc.parallelize(resultBuffer); > //rdd.count(); > rddList.add(rdd); > resultBuffer.clear(); > } > } > JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), > rddList); > System.out.println(desc.count()); > this code should duplicate the original RDD, but it just returns an empty > RDD. Please note that if I uncomment the rdd.count, it will return the > correct result. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20976) Unify Error Messages for FAILFAST mode.
[ https://issues.apache.org/jira/browse/SPARK-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-20976. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18196 [https://github.com/apache/spark/pull/18196] > Unify Error Messages for FAILFAST mode. > > > Key: SPARK-20976 > URL: https://issues.apache.org/jira/browse/SPARK-20976 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.3.0 > > > Previously, we indicate the job was terminated because of `FAILFAST` mode. > {noformat} > Malformed line in FAILFAST mode: {"a":{, b:3} > {noformat} > If possible, we should keep it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21022) RDD.foreach swallows exceptions
[ https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043223#comment-16043223 ] Shixiong Zhu edited comment on SPARK-21022 at 6/8/17 7:05 PM: -- ~~Good catch...~~ was (Author: zsxwing): Good catch... > RDD.foreach swallows exceptions > --- > > Key: SPARK-21022 > URL: https://issues.apache.org/jira/browse/SPARK-21022 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Colin Woodbury >Assignee: Shixiong Zhu >Priority: Minor > > A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown > inside its closure, but not if the exception was thrown earlier in the call > chain. An example: > {code:none} > package examples > import org.apache.spark._ > object Shpark { >def main(args: Array[String]) { > implicit val sc: SparkContext = new SparkContext( >new SparkConf().setMaster("local[*]").setAppName("blahfoobar") > ) > /* DOESN'T THROW > > sc.parallelize(0 until 1000) > >.foreachPartition { _.map { i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}} > > */ > /* DOESN'T THROW, nor does anything print. > > * Commenting out the exception runs the prints. > > * (i.e. `foreach` is sufficient to "run" an RDD) > > sc.parallelize(0 until 10) > >.foreach({ i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}) > > */ > /* Throws! */ > sc.parallelize(0 until 10) >.map({ i => > println("BEFORE THROW") > throw new Exception("Testing exception handling") > i >}) >.foreach(i => println(i)) > println("JOB DONE!") > System.in.read > sc.stop() >} > } > {code} > When exceptions are swallowed, the jobs don't seem to fail, and the driver > exits normally. When one _is_ thrown, as in the last example, the exception > successfully rises up to the driver and can be caught with try/catch. > The expected behaviour is for exceptions in `foreach` to throw and crash the > driver, as they would with `map`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21022) RDD.foreach swallows exceptions
[ https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043247#comment-16043247 ] Shixiong Zhu commented on SPARK-21022: -- By the way, `foreachPartition` doesn't have the issue. It's just because "Iterator.map" is lazy and you don't consume the Iterator. > RDD.foreach swallows exceptions > --- > > Key: SPARK-21022 > URL: https://issues.apache.org/jira/browse/SPARK-21022 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Colin Woodbury >Assignee: Shixiong Zhu >Priority: Minor > > A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown > inside its closure, but not if the exception was thrown earlier in the call > chain. An example: > {code:none} > package examples > import org.apache.spark._ > object Shpark { >def main(args: Array[String]) { > implicit val sc: SparkContext = new SparkContext( >new SparkConf().setMaster("local[*]").setAppName("blahfoobar") > ) > /* DOESN'T THROW > > sc.parallelize(0 until 1000) > >.foreachPartition { _.map { i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}} > > */ > /* DOESN'T THROW, nor does anything print. > > * Commenting out the exception runs the prints. > > * (i.e. `foreach` is sufficient to "run" an RDD) > > sc.parallelize(0 until 10) > >.foreach({ i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}) > > */ > /* Throws! */ > sc.parallelize(0 until 10) >.map({ i => > println("BEFORE THROW") > throw new Exception("Testing exception handling") > i >}) >.foreach(i => println(i)) > println("JOB DONE!") > System.in.read > sc.stop() >} > } > {code} > When exceptions are swallowed, the jobs don't seem to fail, and the driver > exits normally. When one _is_ thrown, as in the last example, the exception > successfully rises up to the driver and can be caught with try/catch. > The expected behaviour is for exceptions in `foreach` to throw and crash the > driver, as they would with `map`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21025) missing data in jsc.union
[ https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043229#comment-16043229 ] Sean Owen commented on SPARK-21025: --- It's not clear why you're parallelizing 'src' to begin with, or why this is a simple reproduction. What are the values and sizes of all the intermediate structures? something else is going wrong here. > missing data in jsc.union > - > > Key: SPARK-21025 > URL: https://issues.apache.org/jira/browse/SPARK-21025 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0, 2.1.1 > Environment: Ubuntu 16.04 >Reporter: meng xi > > we are using an iterator of RDD for some special data processing, and then > using union to rebuild a new RDD. we found the result RDD are often empty or > missing most of the data. Here is a simplified code snippet for this bug: > SparkConf sparkConf = new > SparkConf().setAppName("Test").setMaster("local[*]"); > SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); > JavaSparkContext jsc = > JavaSparkContext.fromSparkContext(sparkContext); > JavaRDD src = jsc.parallelize(IntStream.range(0, > 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); > Iterator it = src.toLocalIterator(); > List> rddList = new LinkedList<>(); > List resultBuffer = new LinkedList<>(); > while (it.hasNext()) { > resultBuffer.add(it.next()); > if (resultBuffer.size() == 1000) { > JavaRDD rdd = jsc.parallelize(resultBuffer); > //rdd.count(); > rddList.add(rdd); > resultBuffer.clear(); > } > } > JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), > rddList); > System.out.println(desc.count()); > this code should duplicate the original RDD, but it just returns an empty > RDD. Please note that if I uncomment the rdd.count, it will return the > correct result. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21022) RDD.foreach swallows exceptions
[ https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-21022: Assignee: Shixiong Zhu > RDD.foreach swallows exceptions > --- > > Key: SPARK-21022 > URL: https://issues.apache.org/jira/browse/SPARK-21022 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Colin Woodbury >Assignee: Shixiong Zhu >Priority: Minor > > A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown > inside its closure, but not if the exception was thrown earlier in the call > chain. An example: > {code:none} > package examples > import org.apache.spark._ > object Shpark { >def main(args: Array[String]) { > implicit val sc: SparkContext = new SparkContext( >new SparkConf().setMaster("local[*]").setAppName("blahfoobar") > ) > /* DOESN'T THROW > > sc.parallelize(0 until 1000) > >.foreachPartition { _.map { i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}} > > */ > /* DOESN'T THROW, nor does anything print. > > * Commenting out the exception runs the prints. > > * (i.e. `foreach` is sufficient to "run" an RDD) > > sc.parallelize(0 until 10) > >.foreach({ i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}) > > */ > /* Throws! */ > sc.parallelize(0 until 10) >.map({ i => > println("BEFORE THROW") > throw new Exception("Testing exception handling") > i >}) >.foreach(i => println(i)) > println("JOB DONE!") > System.in.read > sc.stop() >} > } > {code} > When exceptions are swallowed, the jobs don't seem to fail, and the driver > exits normally. When one _is_ thrown, as in the last example, the exception > successfully rises up to the driver and can be caught with try/catch. > The expected behaviour is for exceptions in `foreach` to throw and crash the > driver, as they would with `map`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21022) RDD.foreach swallows exceptions
[ https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043223#comment-16043223 ] Shixiong Zhu commented on SPARK-21022: -- Good catch... > RDD.foreach swallows exceptions > --- > > Key: SPARK-21022 > URL: https://issues.apache.org/jira/browse/SPARK-21022 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Colin Woodbury >Priority: Minor > > A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown > inside its closure, but not if the exception was thrown earlier in the call > chain. An example: > {code:none} > package examples > import org.apache.spark._ > object Shpark { >def main(args: Array[String]) { > implicit val sc: SparkContext = new SparkContext( >new SparkConf().setMaster("local[*]").setAppName("blahfoobar") > ) > /* DOESN'T THROW > > sc.parallelize(0 until 1000) > >.foreachPartition { _.map { i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}} > > */ > /* DOESN'T THROW, nor does anything print. > > * Commenting out the exception runs the prints. > > * (i.e. `foreach` is sufficient to "run" an RDD) > > sc.parallelize(0 until 10) > >.foreach({ i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}) > > */ > /* Throws! */ > sc.parallelize(0 until 10) >.map({ i => > println("BEFORE THROW") > throw new Exception("Testing exception handling") > i >}) >.foreach(i => println(i)) > println("JOB DONE!") > System.in.read > sc.stop() >} > } > {code} > When exceptions are swallowed, the jobs don't seem to fail, and the driver > exits normally. When one _is_ thrown, as in the last example, the exception > successfully rises up to the driver and can be caught with try/catch. > The expected behaviour is for exceptions in `foreach` to throw and crash the > driver, as they would with `map`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043209#comment-16043209 ] Marcelo Vanzin commented on SPARK-21023: Then your best bet is a new command line option that implements the behavior you want. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043191#comment-16043191 ] Lantao Jin commented on SPARK-21023: I think {{--conf}} couldn't help this. Because from the view of infra team, they hope their cluster level configuration can take effect in all jobs if no customer overwrite it. Does it make sense if we add a switch val in spark-env.sh? > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043179#comment-16043179 ] Lantao Jin commented on SPARK-21023: {quote} it may break existing applications {quote} I really know the risk and hope to do the right thing. Need find a way to keep current behavior and can easily control the behavior follow we want. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043176#comment-16043176 ] Marcelo Vanzin commented on SPARK-21023: bq. When and where the new config option be set? That's what makes that option awkward. It would have to be set in the user config or in the command line with {{\-\-conf}}. So it's not that much different from a new command line option, other than it avoids a new command line option. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043172#comment-16043172 ] Lantao Jin commented on SPARK-21023: {quote} Another option is to have a config option {quote} Oh, sorry. {{--properties-file}} can skip to load the default configuration file. When and where the new config option be set? In spark-env? > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043168#comment-16043168 ] Marcelo Vanzin commented on SPARK-21023: bq. The purpose is making the default configuration loaded anytime. We all understand the purpose. But it breaks the existing behavior, so it may break existing applications. That makes your solution, as presented, a non-starter. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21025) missing data in jsc.union
meng xi created SPARK-21025: --- Summary: missing data in jsc.union Key: SPARK-21025 URL: https://issues.apache.org/jira/browse/SPARK-21025 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.1.1, 2.1.0 Environment: Ubuntu 16.04 Reporter: meng xi we are using an iterator of RDD for some special data processing, and then using union to rebuild a new RDD. we found the result RDD are often empty or missing most of the data. Here is a simplified code snippet for this bug: SparkConf sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]"); SparkContext sparkContext = SparkContext.getOrCreate(sparkConf); JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext); JavaRDD src = jsc.parallelize(IntStream.range(0, 3000).mapToObj(i -> new String[10]).collect(Collectors.toList())); Iterator it = src.toLocalIterator(); List> rddList = new LinkedList<>(); List resultBuffer = new LinkedList<>(); while (it.hasNext()) { resultBuffer.add(it.next()); if (resultBuffer.size() == 1000) { JavaRDD rdd = jsc.parallelize(resultBuffer); //rdd.count(); rddList.add(rdd); resultBuffer.clear(); } } JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), rddList); System.out.println(desc.count()); this code should duplicate the original RDD, but it just returns an empty RDD. Please note that if I uncomment the rdd.count, it will return the correct result. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043164#comment-16043164 ] Lantao Jin commented on SPARK-21023: {quote} Another option is to have a config option {quote} LGTM > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043160#comment-16043160 ] Lantao Jin commented on SPARK-21023: *The purpose is making the default configuration loaded anytime.* Because the parameters app developer set always less the it should be. For example: App dev set spark.executor.instances=100 in their properties file. But one month later the spark version upgrade to a new version by infra team and dynamic resource allocation enabled. But the old job can not load the new parameters so no dynamic feature enable for it. It still causes more challenge to control cluster for infra team and bad performance for app team. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043160#comment-16043160 ] Lantao Jin edited comment on SPARK-21023 at 6/8/17 6:06 PM: The purpose is making the default configuration loaded anytime. Because the parameters app developer set always less the it should be. For example: App dev set spark.executor.instances=100 in their properties file. But one month later the spark version upgrade to a new version by infra team and dynamic resource allocation enabled. But the old job can not load the new parameters so no dynamic feature enable for it. It still causes more challenge to control cluster for infra team and bad performance for app team. was (Author: cltlfcjin): *The purpose is making the default configuration loaded anytime.* Because the parameters app developer set always less the it should be. For example: App dev set spark.executor.instances=100 in their properties file. But one month later the spark version upgrade to a new version by infra team and dynamic resource allocation enabled. But the old job can not load the new parameters so no dynamic feature enable for it. It still causes more challenge to control cluster for infra team and bad performance for app team. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20971) Purge the metadata log for FileStreamSource
[ https://issues.apache.org/jira/browse/SPARK-20971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043150#comment-16043150 ] Shixiong Zhu commented on SPARK-20971: -- FileStreamSource saves the seen files in the disk/HDFS, we can use the similar way like org.apache.spark.sql.execution.streaming.FileStreamSource.SeenFilesMap to purge the file entries. > Purge the metadata log for FileStreamSource > --- > > Key: SPARK-20971 > URL: https://issues.apache.org/jira/browse/SPARK-20971 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.1 >Reporter: Shixiong Zhu > > Currently > [FileStreamSource.commit|https://github.com/apache/spark/blob/16186cdcbce1a2ec8f839c550e6b571bf5dc2692/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L258] > is empty. We can delete unused metadata logs in this method to reduce the > size of log files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043132#comment-16043132 ] Marcelo Vanzin commented on SPARK-21023: Another option is to have a config option that controls whether the default file is loaded on top of {{--properties-file}}. If avoids adding a new command line argument, but is a little more awkward to use. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043131#comment-16043131 ] Marcelo Vanzin commented on SPARK-21023: bq. I suggest to change the current behavior Yes, and we're saying that should not be done, because it's a change in semantics that might cause breakages in people's workflows. Regardless of whether the new behavior is better or worse, implementing it is a breaking change. If you want this you need to implement it in a way that does not change the current behavior - e.g., as a new command line argument instead of modifying the behavior of the existing one. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043124#comment-16043124 ] Lantao Jin commented on SPARK-21023: [~vanzin] I suggest to change the current behavior and offer a document to illustrate this. --properties-file will overwrite the args which are set in spark-defaults.conf first. It's equivalent to set dozens of {{--conf k=v}} in command line. Please review and open for any ideas. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043124#comment-16043124 ] Lantao Jin edited comment on SPARK-21023 at 6/8/17 5:45 PM: [~vanzin] I suggest to change the current behavior and offer a document to illustrate this. \-\-properties-file will overwrite the args which are set in spark-defaults.conf first. It's equivalent to set dozens of {{--conf k=v}} in command line. Please review and open for any ideas. was (Author: cltlfcjin): [~vanzin] I suggest to change the current behavior and offer a document to illustrate this. --properties-file will overwrite the args which are set in spark-defaults.conf first. It's equivalent to set dozens of {{--conf k=v}} in command line. Please review and open for any ideas. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21023: Assignee: Apache Spark > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Assignee: Apache Spark >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043097#comment-16043097 ] Apache Spark commented on SPARK-21023: -- User 'LantaoJin' has created a pull request for this issue: https://github.com/apache/spark/pull/18243 > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21023: Assignee: (was: Apache Spark) > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043053#comment-16043053 ] Marcelo Vanzin edited comment on SPARK-21023 at 6/8/17 5:11 PM: I thought we had an issue for adding a user-specific config file that is loaded on top of the defaults, but I can't find it. In any case, changing the current behavior is not really desired, but you can add this as a new feature without changing the current behavior. was (Author: vanzin): I thought we have an issue for adding a user-specific config file that is loaded on top of the defaults, but I can't find it. In any case, changing the current behavior is not really desired, but you can add this as a new feature without changing the current behavior. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043053#comment-16043053 ] Marcelo Vanzin commented on SPARK-21023: I thought we have an issue for adding a user-specific config file that is loaded on top of the defaults, but I can't find it. In any case, changing the current behavior is not really desired, but you can add this as a new feature without changing the current behavior. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
[ https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043036#comment-16043036 ] Sean Owen commented on SPARK-21023: --- Maybe, but it would be a behavior change now. There are equal counter-arguments for the current behavior. > Ignore to load default properties file is not a good choice from the > perspective of system > -- > > Key: SPARK-21023 > URL: https://issues.apache.org/jira/browse/SPARK-21023 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.1 >Reporter: Lantao Jin >Priority: Minor > > The default properties file {{spark-defaults.conf}} shouldn't be ignore to > load even though the submit arg {{--properties-file}} is set. The reasons are > very easy to see: > * Infrastructure team need continually update the {{spark-defaults.conf}} > when they want set something as default for entire cluster as a tuning > purpose. > * Application developer only want to override the parameters they really want > rather than others they even doesn't know (Set by infrastructure team). > * The purpose of using {{\-\-properties-file}} from most of application > developers is to avoid setting dozens of {{--conf k=v}}. But if > {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. > All this caused by below codes: > {code} > private Properties loadPropertiesFile() throws IOException { > Properties props = new Properties(); > File propsFile; > if (propertiesFile != null) { > // default conf property file will not be loaded when app developer use > --properties-file as a submit args > propsFile = new File(propertiesFile); > checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", > propertiesFile); > } else { > propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); > } > //... > return props; > } > {code} > I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043030#comment-16043030 ] Marcelo Vanzin commented on SPARK-19185: I merged Mark's patch above to master and branch-2.2, but it's just a work-around, not a fix, so I'll leave the bug open (and with no "fix version"). > ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing > - > > Key: SPARK-19185 > URL: https://issues.apache.org/jira/browse/SPARK-19185 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Spark Streaming Kafka 010 > Mesos 0.28.0 - client mode > spark.executor.cores 1 > spark.mesos.extra.cores 1 >Reporter: Kalvin Chau > Labels: streaming, windowing > > We've been running into ConcurrentModificationExcpetions "KafkaConsumer is > not safe for multi-threaded access" with the CachedKafkaConsumer. I've been > working through debugging this issue and after looking through some of the > spark source code I think this is a bug. > Our set up is: > Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using > Spark-Streaming-Kafka-010 > spark.executor.cores 1 > spark.mesos.extra.cores 1 > Batch interval: 10s, window interval: 180s, and slide interval: 30s > We would see the exception when in one executor there are two task worker > threads assigned the same Topic+Partition, but a different set of offsets. > They would both get the same CachedKafkaConsumer, and whichever task thread > went first would seek and poll for all the records, and at the same time the > second thread would try to seek to its offset but fail because it is unable > to acquire the lock. > Time0 E0 Task0 - TopicPartition("abc", 0) X to Y > Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z > Time1 E0 Task0 - Seeks and starts to poll > Time1 E0 Task1 - Attempts to seek, but fails > Here are some relevant logs: > {code} > 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394204414 -> 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394238058 -> 4394257712 > 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394204414 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: > Initial fetch for spark-executor-consumer test-topic 2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Seeking to test-topic-2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting > block rdd_199_2 failed due to an exception > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block > rdd_199_2 could not be removed as it was not found on disk or in memory > 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in > task 49.0 in stage 45.0 (TID 3201) > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org
[jira] [Comment Edited] (SPARK-21001) Staging folders from Hive table are not being cleared.
[ https://issues.apache.org/jira/browse/SPARK-21001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042981#comment-16042981 ] Ajay Cherukuri edited comment on SPARK-21001 at 6/8/17 4:51 PM: Hi Liang, do you mean 2.0.3? was (Author: ajaycherukuri): do you mean 2.0.3? > Staging folders from Hive table are not being cleared. > -- > > Key: SPARK-21001 > URL: https://issues.apache.org/jira/browse/SPARK-21001 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Ajay Cherukuri > > Staging folders that were being created as a part of Data loading to Hive > table by using spark job, are not cleared. > Staging folder are remaining in Hive External table folders even after Spark > job is completed. > This is the same issue mentioned in the > ticket:https://issues.apache.org/jira/browse/SPARK-18372 > This ticket says the issues was resolved in 1.6.4. But, now i found that it's > still existing on 2.0.2. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21013) Spark History Server does not show the logs of completed Yarn Jobs
[ https://issues.apache.org/jira/browse/SPARK-21013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-21013. Resolution: Duplicate You need the MR history server for aggregated logs to show. This is already explained in Spark's documentation. > Spark History Server does not show the logs of completed Yarn Jobs > -- > > Key: SPARK-21013 > URL: https://issues.apache.org/jira/browse/SPARK-21013 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.1, 2.0.1, 2.1.0 >Reporter: Hari Ck >Priority: Minor > Labels: historyserver, ui > > I am facing issue when accessing the container logs of a completed Spark > (Yarn) application from the History Server. > Repro Steps: > 1) Run the spark-shell in yarn client mode. Or run Pi job in Yarn mode. > 2) Once the job is completed, (in the case of spark shell, exit after doing > some simple operations), try to access the STDOUT or STDERR logs of the > application from the Executors tab in the Spark History Server UI. > 3) If yarn log aggregation is enabled, then logs won't be available in node > manager's log location. But history Server is trying to access the logs from > the nodemanager's log location giving below error in the UI: > Failed redirect for container_e31_1496881617682_0003_01_02 > ResourceManager > RM Home > NodeManager > Tools > Failed while trying to construct the redirect url to the log server. Log > Server url may not be configured > java.lang.Exception: Unknown container. Container either has not started or > has already completed or doesn't belong to this node at all. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21001) Staging folders from Hive table are not being cleared.
[ https://issues.apache.org/jira/browse/SPARK-21001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042981#comment-16042981 ] Ajay Cherukuri commented on SPARK-21001: do you mean 2.0.3? > Staging folders from Hive table are not being cleared. > -- > > Key: SPARK-21001 > URL: https://issues.apache.org/jira/browse/SPARK-21001 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Ajay Cherukuri > > Staging folders that were being created as a part of Data loading to Hive > table by using spark job, are not cleared. > Staging folder are remaining in Hive External table folders even after Spark > job is completed. > This is the same issue mentioned in the > ticket:https://issues.apache.org/jira/browse/SPARK-18372 > This ticket says the issues was resolved in 1.6.4. But, now i found that it's > still existing on 2.0.2. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21024) CSV parse mode handles Univocity parser exceptions
[ https://issues.apache.org/jira/browse/SPARK-21024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042925#comment-16042925 ] Takeshi Yamamuro commented on SPARK-21024: -- Is it worth fixing this? (I feel it is a kind of conrner cases..) cc: [~smilegator] [~hyukjin.kwon] > CSV parse mode handles Univocity parser exceptions > -- > > Key: SPARK-21024 > URL: https://issues.apache.org/jira/browse/SPARK-21024 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > The current master cannot skip the illegal records that Univocity parsers: > This comes from the spark-user mailing list: > https://www.mail-archive.com/user@spark.apache.org/msg63985.html > {code} > scala> Seq("0,1", "0,1,2,3").toDF().write.text("/Users/maropu/Desktop/data") > scala> val df = spark.read.format("csv").schema("a int, b > int").option("maxColumns", "3").load("/Users/maropu/Desktop/data") > scala> df.show > com.univocity.parsers.common.TextParsingException: > java.lang.ArrayIndexOutOfBoundsException - 3 > Hint: Number of columns processed may have exceeded limit of 3 columns. Use > settings.setMaxColumns(int) to define the maximum number of columns your > input can have > Ensure your configuration is correct, with delimiters, quotes and escape > sequences that match the input format you are trying to parse > Parser Configuration: CsvParserSettings: > Auto configuration enabled=true > Autodetect column delimiter=false > Autodetect quotes=false > Column reordering enabled=true > Empty value=null > Escape unquoted values=false > ... > at > com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) > at > com.univocity.parsers.common.AbstractParser.handleEOF(AbstractParser.java:195) > at > com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:544) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:191) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308) > at > org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:60) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312) > at > org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > ... > {code} > We could easily fix this like: > https://github.com/apache/spark/compare/master...maropu:HandleExceptionInParser -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21024) CSV parse mode handles Univocity parser exceptions
Takeshi Yamamuro created SPARK-21024: Summary: CSV parse mode handles Univocity parser exceptions Key: SPARK-21024 URL: https://issues.apache.org/jira/browse/SPARK-21024 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.1 Reporter: Takeshi Yamamuro Priority: Minor The current master cannot skip the illegal records that Univocity parsers: This comes from the spark-user mailing list: https://www.mail-archive.com/user@spark.apache.org/msg63985.html {code} scala> Seq("0,1", "0,1,2,3").toDF().write.text("/Users/maropu/Desktop/data") scala> val df = spark.read.format("csv").schema("a int, b int").option("maxColumns", "3").load("/Users/maropu/Desktop/data") scala> df.show com.univocity.parsers.common.TextParsingException: java.lang.ArrayIndexOutOfBoundsException - 3 Hint: Number of columns processed may have exceeded limit of 3 columns. Use settings.setMaxColumns(int) to define the maximum number of columns your input can have Ensure your configuration is correct, with delimiters, quotes and escape sequences that match the input format you are trying to parse Parser Configuration: CsvParserSettings: Auto configuration enabled=true Autodetect column delimiter=false Autodetect quotes=false Column reordering enabled=true Empty value=null Escape unquoted values=false ... at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339) at com.univocity.parsers.common.AbstractParser.handleEOF(AbstractParser.java:195) at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:544) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:191) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308) at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:60) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312) at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) ... {code} We could easily fix this like: https://github.com/apache/spark/compare/master...maropu:HandleExceptionInParser -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system
Lantao Jin created SPARK-21023: -- Summary: Ignore to load default properties file is not a good choice from the perspective of system Key: SPARK-21023 URL: https://issues.apache.org/jira/browse/SPARK-21023 Project: Spark Issue Type: Improvement Components: Spark Submit Affects Versions: 2.1.1 Reporter: Lantao Jin Priority: Minor The default properties file {{spark-defaults.conf}} shouldn't be ignore to load even though the submit arg {{--properties-file}} is set. The reasons are very easy to see: * Infrastructure team need continually update the {{spark-defaults.conf}} when they want set something as default for entire cluster as a tuning purpose. * Application developer only want to override the parameters they really want rather than others they even doesn't know (Set by infrastructure team). * The purpose of using {{\-\-properties-file}} from most of application developers is to avoid setting dozens of {{--conf k=v}}. But if {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally. All this caused by below codes: {code} private Properties loadPropertiesFile() throws IOException { Properties props = new Properties(); File propsFile; if (propertiesFile != null) { // default conf property file will not be loaded when app developer use --properties-file as a submit args propsFile = new File(propertiesFile); checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", propertiesFile); } else { propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE); } //... return props; } {code} I can offer a patch to fix it if you think it make sense. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21022) RDD.foreach swallows exceptions
[ https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Colin Woodbury updated SPARK-21022: --- Summary: RDD.foreach swallows exceptions (was: foreach swallows exceptions) > RDD.foreach swallows exceptions > --- > > Key: SPARK-21022 > URL: https://issues.apache.org/jira/browse/SPARK-21022 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Colin Woodbury >Priority: Minor > > A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown > inside its closure, but not if the exception was thrown earlier in the call > chain. An example: > {code:none} > package examples > import org.apache.spark._ > object Shpark { >def main(args: Array[String]) { > implicit val sc: SparkContext = new SparkContext( >new SparkConf().setMaster("local[*]").setAppName("blahfoobar") > ) > /* DOESN'T THROW > > sc.parallelize(0 until 1000) > >.foreachPartition { _.map { i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}} > > */ > /* DOESN'T THROW, nor does anything print. > > * Commenting out the exception runs the prints. > > * (i.e. `foreach` is sufficient to "run" an RDD) > > sc.parallelize(0 until 10) > >.foreach({ i => > > println("BEFORE THROW") > > throw new Exception("Testing exception handling") > > println(i) > >}) > > */ > /* Throws! */ > sc.parallelize(0 until 10) >.map({ i => > println("BEFORE THROW") > throw new Exception("Testing exception handling") > i >}) >.foreach(i => println(i)) > println("JOB DONE!") > System.in.read > sc.stop() >} > } > {code} > When exceptions are swallowed, the jobs don't seem to fail, and the driver > exits normally. When one _is_ thrown, as in the last example, the exception > successfully rises up to the driver and can be caught with try/catch. > The expected behaviour is for exceptions in `foreach` to throw and crash the > driver, as they would with `map`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org