[jira] [Updated] (SPARK-21031) Clearly separate hive stats and spark stats in catalog

2017-06-08 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-21031:
-
Description: 
Currently, hive's stats are read into `CatalogStatistics`, while spark's stats 
are also persisted through `CatalogStatistics`. Therefore, in 
`CatalogStatistics`, we cannot tell whether its stats is from hive or spark. As 
a result, hive's stats can be unexpectedly propagated into spark' stats.

For example, for a catalog table, we read stats from hive, e.g. "totalSize" and 
put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will 
store the stats in `CatalogStatistics` into metastore as spark's stats (because 
we don't know whether it's from spark or not). But spark's stats should be only 
generated by "ANALYZE" command. This is unexpected from this command.

Secondly, now that we have spark's stats in metastore, after inserting new 
data, although hive updated "totalSize" in metastore, we still cannot get the 
right `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats 
(should not exist) over hive's stats.

{code}
spark-sql> create table xx(i string, j string);
spark-sql> insert into table xx select 'a', 'b';

spark-sql> desc formatted xx;
# col_name  data_type   comment
i   string  NULL
j   string  NULL
# Detailed Table Information
Databasedefault 
Table   xx  
Owner   wzh 
Created Thu Jun 08 18:30:46 PDT 2017
Last Access Wed Dec 31 16:00:00 PST 1969
TypeMANAGED 
Providerhive
Properties  [serialization.format=1]
Statistics  4 bytes 
Locationfile:/Users/wzh/Projects/spark/spark-warehouse/xx   
Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
Partition Provider  Catalog 
Time taken: 0.089 seconds, Fetched 19 row(s)

spark-sql> alter table xx set tblproperties ('prop1' = 'yy');
Time taken: 0.187 seconds

spark-sql> insert into table xx select 'c', 'd';
Time taken: 0.583 seconds

spark-sql> desc formatted xx;
# col_name  data_type   comment
i   string  NULL
j   string  NULL
# Detailed Table Information
Databasedefault 
Table   xx  
Owner   wzh 
Created Thu Jun 08 18:30:46 PDT 2017
Last Access Wed Dec 31 16:00:00 PST 1969
TypeMANAGED 
Providerhive
Properties  [serialization.format=1]
Statistics  4 bytes (-- This should be 8 bytes)
Locationfile:/Users/wzh/Projects/spark/spark-warehouse/xx   
Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
Partition Provider  Catalog 
Time taken: 0.077 seconds, Fetched 19 row(s)
{code}

  was:

Currently, hive's stats are read into `CatalogStatistics`, while spark's stats 
are also persisted through `CatalogStatistics`. Therefore, in 
`CatalogStatistics`, we cannot tell whether its stats is from hive or spark. As 
a result, hive's stats can be unexpectedly propagated into spark' stats.

For example, for a catalog table, we read stats from hive, e.g. "totalSize" and 
put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will 
store the stats in `CatalogStatistics` into metastore as spark's stats (because 
we don't know whether it's from spark or not). But spark's stats should be only 
generated by "ANALYZE" command. This is unexpected from this command.

Secondly, now that we store wrong spark's stats, after inserting new data, 
although hive updated "totalSize" in metastore, we still cannot get the right 
`sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (wrong 
stats) over hive's stats.

{code}
spark-sql> create table xx(i string, j string);
spark-sql> insert into table xx select 'a', 'b';

spark-sql> desc formatted xx;
# col_name  data_type   comment
i   string  NULL
j   string  NULL
# Detailed Table Information
Databasedefault 
Table   xx  
Owner   wzh 
Created Thu Jun 08 18:30:46 PDT 2017
Last Access Wed Dec 31 16:00:00 PST 1969
TypeMANAGED 
Providerhive
Properties  [serialization.format=1]
Statistics  4 bytes 
Locationfile:/Users/wzh/Projects/spark/spark-warehouse/xx   
Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
Partition Provider  Catalog 
Time taken: 0.089 seconds, Fetched 19 row(s)

spark-sql> alter table xx set tblproperties ('prop1' = 'yy');
Time taken: 0.187 sec

[jira] [Assigned] (SPARK-21031) Clearly separate hive stats and spark stats in catalog

2017-06-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21031:


Assignee: (was: Apache Spark)

> Clearly separate hive stats and spark stats in catalog
> --
>
> Key: SPARK-21031
> URL: https://issues.apache.org/jira/browse/SPARK-21031
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> Currently, hive's stats are read into `CatalogStatistics`, while spark's 
> stats are also persisted through `CatalogStatistics`. Therefore, in 
> `CatalogStatistics`, we cannot tell whether its stats is from hive or spark. 
> As a result, hive's stats can be unexpectedly propagated into spark' stats.
> For example, for a catalog table, we read stats from hive, e.g. "totalSize" 
> and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we 
> will store the stats in `CatalogStatistics` into metastore as spark's stats 
> (because we don't know whether it's from spark or not). But spark's stats 
> should be only generated by "ANALYZE" command. This is unexpected from this 
> command.
> Secondly, now that we store wrong spark's stats, after inserting new data, 
> although hive updated "totalSize" in metastore, we still cannot get the right 
> `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (wrong 
> stats) over hive's stats.
> {code}
> spark-sql> create table xx(i string, j string);
> spark-sql> insert into table xx select 'a', 'b';
> spark-sql> desc formatted xx;
> # col_namedata_type   comment
> i string  NULL
> j string  NULL
> # Detailed Table Information  
> Database  default 
> Table xx  
> Owner wzh 
> Created   Thu Jun 08 18:30:46 PDT 2017
> Last Access   Wed Dec 31 16:00:00 PST 1969
> Type  MANAGED 
> Provider  hive
> Properties[serialization.format=1]
> Statistics4 bytes 
> Location  file:/Users/wzh/Projects/spark/spark-warehouse/xx   
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
> InputFormat   org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
> Partition ProviderCatalog 
> Time taken: 0.089 seconds, Fetched 19 row(s)
> spark-sql> alter table xx set tblproperties ('prop1' = 'yy');
> Time taken: 0.187 seconds
> spark-sql> insert into table xx select 'c', 'd';
> Time taken: 0.583 seconds
> spark-sql> desc formatted xx;
> # col_namedata_type   comment
> i string  NULL
> j string  NULL
> # Detailed Table Information  
> Database  default 
> Table xx  
> Owner wzh 
> Created   Thu Jun 08 18:30:46 PDT 2017
> Last Access   Wed Dec 31 16:00:00 PST 1969
> Type  MANAGED 
> Provider  hive
> Properties[serialization.format=1]
> Statistics4 bytes (-- This should be 8 bytes)
> Location  file:/Users/wzh/Projects/spark/spark-warehouse/xx   
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
> InputFormat   org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
> Partition ProviderCatalog 
> Time taken: 0.077 seconds, Fetched 19 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21031) Clearly separate hive stats and spark stats in catalog

2017-06-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16044009#comment-16044009
 ] 

Apache Spark commented on SPARK-21031:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/18248

> Clearly separate hive stats and spark stats in catalog
> --
>
> Key: SPARK-21031
> URL: https://issues.apache.org/jira/browse/SPARK-21031
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> Currently, hive's stats are read into `CatalogStatistics`, while spark's 
> stats are also persisted through `CatalogStatistics`. Therefore, in 
> `CatalogStatistics`, we cannot tell whether its stats is from hive or spark. 
> As a result, hive's stats can be unexpectedly propagated into spark' stats.
> For example, for a catalog table, we read stats from hive, e.g. "totalSize" 
> and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we 
> will store the stats in `CatalogStatistics` into metastore as spark's stats 
> (because we don't know whether it's from spark or not). But spark's stats 
> should be only generated by "ANALYZE" command. This is unexpected from this 
> command.
> Secondly, now that we store wrong spark's stats, after inserting new data, 
> although hive updated "totalSize" in metastore, we still cannot get the right 
> `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (wrong 
> stats) over hive's stats.
> {code}
> spark-sql> create table xx(i string, j string);
> spark-sql> insert into table xx select 'a', 'b';
> spark-sql> desc formatted xx;
> # col_namedata_type   comment
> i string  NULL
> j string  NULL
> # Detailed Table Information  
> Database  default 
> Table xx  
> Owner wzh 
> Created   Thu Jun 08 18:30:46 PDT 2017
> Last Access   Wed Dec 31 16:00:00 PST 1969
> Type  MANAGED 
> Provider  hive
> Properties[serialization.format=1]
> Statistics4 bytes 
> Location  file:/Users/wzh/Projects/spark/spark-warehouse/xx   
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
> InputFormat   org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
> Partition ProviderCatalog 
> Time taken: 0.089 seconds, Fetched 19 row(s)
> spark-sql> alter table xx set tblproperties ('prop1' = 'yy');
> Time taken: 0.187 seconds
> spark-sql> insert into table xx select 'c', 'd';
> Time taken: 0.583 seconds
> spark-sql> desc formatted xx;
> # col_namedata_type   comment
> i string  NULL
> j string  NULL
> # Detailed Table Information  
> Database  default 
> Table xx  
> Owner wzh 
> Created   Thu Jun 08 18:30:46 PDT 2017
> Last Access   Wed Dec 31 16:00:00 PST 1969
> Type  MANAGED 
> Provider  hive
> Properties[serialization.format=1]
> Statistics4 bytes (-- This should be 8 bytes)
> Location  file:/Users/wzh/Projects/spark/spark-warehouse/xx   
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
> InputFormat   org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
> Partition ProviderCatalog 
> Time taken: 0.077 seconds, Fetched 19 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21031) Clearly separate hive stats and spark stats in catalog

2017-06-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21031:


Assignee: Apache Spark

> Clearly separate hive stats and spark stats in catalog
> --
>
> Key: SPARK-21031
> URL: https://issues.apache.org/jira/browse/SPARK-21031
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>
> Currently, hive's stats are read into `CatalogStatistics`, while spark's 
> stats are also persisted through `CatalogStatistics`. Therefore, in 
> `CatalogStatistics`, we cannot tell whether its stats is from hive or spark. 
> As a result, hive's stats can be unexpectedly propagated into spark' stats.
> For example, for a catalog table, we read stats from hive, e.g. "totalSize" 
> and put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we 
> will store the stats in `CatalogStatistics` into metastore as spark's stats 
> (because we don't know whether it's from spark or not). But spark's stats 
> should be only generated by "ANALYZE" command. This is unexpected from this 
> command.
> Secondly, now that we store wrong spark's stats, after inserting new data, 
> although hive updated "totalSize" in metastore, we still cannot get the right 
> `sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (wrong 
> stats) over hive's stats.
> {code}
> spark-sql> create table xx(i string, j string);
> spark-sql> insert into table xx select 'a', 'b';
> spark-sql> desc formatted xx;
> # col_namedata_type   comment
> i string  NULL
> j string  NULL
> # Detailed Table Information  
> Database  default 
> Table xx  
> Owner wzh 
> Created   Thu Jun 08 18:30:46 PDT 2017
> Last Access   Wed Dec 31 16:00:00 PST 1969
> Type  MANAGED 
> Provider  hive
> Properties[serialization.format=1]
> Statistics4 bytes 
> Location  file:/Users/wzh/Projects/spark/spark-warehouse/xx   
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
> InputFormat   org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
> Partition ProviderCatalog 
> Time taken: 0.089 seconds, Fetched 19 row(s)
> spark-sql> alter table xx set tblproperties ('prop1' = 'yy');
> Time taken: 0.187 seconds
> spark-sql> insert into table xx select 'c', 'd';
> Time taken: 0.583 seconds
> spark-sql> desc formatted xx;
> # col_namedata_type   comment
> i string  NULL
> j string  NULL
> # Detailed Table Information  
> Database  default 
> Table xx  
> Owner wzh 
> Created   Thu Jun 08 18:30:46 PDT 2017
> Last Access   Wed Dec 31 16:00:00 PST 1969
> Type  MANAGED 
> Provider  hive
> Properties[serialization.format=1]
> Statistics4 bytes (-- This should be 8 bytes)
> Location  file:/Users/wzh/Projects/spark/spark-warehouse/xx   
> Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
> InputFormat   org.apache.hadoop.mapred.TextInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
> Partition ProviderCatalog 
> Time taken: 0.077 seconds, Fetched 19 row(s)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21031) Clearly separate hive stats and spark stats in catalog

2017-06-08 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-21031:
-
Description: 

Currently, hive's stats are read into `CatalogStatistics`, while spark's stats 
are also persisted through `CatalogStatistics`. Therefore, in 
`CatalogStatistics`, we cannot tell whether its stats is from hive or spark. As 
a result, hive's stats can be unexpectedly propagated into spark' stats.

For example, for a catalog table, we read stats from hive, e.g. "totalSize" and 
put it into `CatalogStatistics`. Then, by using "ALTER TABLE" command, we will 
store the stats in `CatalogStatistics` into metastore as spark's stats (because 
we don't know whether it's from spark or not). But spark's stats should be only 
generated by "ANALYZE" command. This is unexpected from this command.

Secondly, now that we store wrong spark's stats, after inserting new data, 
although hive updated "totalSize" in metastore, we still cannot get the right 
`sizeInBytes` in `CatalogStatistics`, because we respect spark's stats (wrong 
stats) over hive's stats.

{code}
spark-sql> create table xx(i string, j string);
spark-sql> insert into table xx select 'a', 'b';

spark-sql> desc formatted xx;
# col_name  data_type   comment
i   string  NULL
j   string  NULL
# Detailed Table Information
Databasedefault 
Table   xx  
Owner   wzh 
Created Thu Jun 08 18:30:46 PDT 2017
Last Access Wed Dec 31 16:00:00 PST 1969
TypeMANAGED 
Providerhive
Properties  [serialization.format=1]
Statistics  4 bytes 
Locationfile:/Users/wzh/Projects/spark/spark-warehouse/xx   
Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
Partition Provider  Catalog 
Time taken: 0.089 seconds, Fetched 19 row(s)

spark-sql> alter table xx set tblproperties ('prop1' = 'yy');
Time taken: 0.187 seconds

spark-sql> insert into table xx select 'c', 'd';
Time taken: 0.583 seconds

spark-sql> desc formatted xx;
# col_name  data_type   comment
i   string  NULL
j   string  NULL
# Detailed Table Information
Databasedefault 
Table   xx  
Owner   wzh 
Created Thu Jun 08 18:30:46 PDT 2017
Last Access Wed Dec 31 16:00:00 PST 1969
TypeMANAGED 
Providerhive
Properties  [serialization.format=1]
Statistics  4 bytes (-- This should be 8 bytes)
Locationfile:/Users/wzh/Projects/spark/spark-warehouse/xx   
Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
Partition Provider  Catalog 
Time taken: 0.077 seconds, Fetched 19 row(s)
{code}

  was:
Currently, hive's stats are read into `CatalogStatistics`, while spark's stats 
are also persisted through `CatalogStatistics`. Therefore, in 
`CatalogStatistics`, we cannot tell whether its stats is from hive or spark. As 
a result, hive's stats can be unexpectedly propagated into spark' stats.

For example, by using "ALTER TABLE" command, we will store the stats info (read 
from hive, e.g. "totalSize") in `CatalogStatistics` into metastore as spark's 
stats (because we don't know whether it's from spark or not). But spark's stats 
should be only generated by "ANALYZE" command. This is unexpected from this 
command.

Besides, now that we store wrong spark's stats, after inserting new data, 
although hive updated "totalSize" in metastore, we still cannot get the right 
`sizeInBytes` in `CatalogStatistics`, because we respect the wrong spark stats 
over hive's stats.

{code}
spark-sql> create table xx(i string, j string);
spark-sql> insert into table xx select 'a', 'b';

spark-sql> desc formatted xx;
# col_name  data_type   comment
i   string  NULL
j   string  NULL
# Detailed Table Information
Databasedefault 
Table   xx  
Owner   wzh 
Created Thu Jun 08 18:30:46 PDT 2017
Last Access Wed Dec 31 16:00:00 PST 1969
TypeMANAGED 
Providerhive
Properties  [serialization.format=1]
Statistics  4 bytes 
Locationfile:/Users/wzh/Projects/spark/spark-warehouse/xx   
Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
Partition Provider  Catalog 
Time taken: 0.089 seconds, Fetched 19 row(s)

spark-sql> alter table xx set tblproperties ('prop1' = 'yy');
Time taken: 0.187 seconds

spark-sql> insert into table xx select 'c', 'd';
Time taken: 0.583 seconds

sp

[jira] [Created] (SPARK-21031) Clearly separate hive stats and spark stats in catalog

2017-06-08 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-21031:


 Summary: Clearly separate hive stats and spark stats in catalog
 Key: SPARK-21031
 URL: https://issues.apache.org/jira/browse/SPARK-21031
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Zhenhua Wang


Currently, hive's stats are read into `CatalogStatistics`, while spark's stats 
are also persisted through `CatalogStatistics`. Therefore, in 
`CatalogStatistics`, we cannot tell whether its stats is from hive or spark. As 
a result, hive's stats can be unexpectedly propagated into spark' stats.

For example, by using "ALTER TABLE" command, we will store the stats info (read 
from hive, e.g. "totalSize") in `CatalogStatistics` into metastore as spark's 
stats (because we don't know whether it's from spark or not). But spark's stats 
should be only generated by "ANALYZE" command. This is unexpected from this 
command.

Besides, now that we store wrong spark's stats, after inserting new data, 
although hive updated "totalSize" in metastore, we still cannot get the right 
`sizeInBytes` in `CatalogStatistics`, because we respect the wrong spark stats 
over hive's stats.

{code}
spark-sql> create table xx(i string, j string);
spark-sql> insert into table xx select 'a', 'b';

spark-sql> desc formatted xx;
# col_name  data_type   comment
i   string  NULL
j   string  NULL
# Detailed Table Information
Databasedefault 
Table   xx  
Owner   wzh 
Created Thu Jun 08 18:30:46 PDT 2017
Last Access Wed Dec 31 16:00:00 PST 1969
TypeMANAGED 
Providerhive
Properties  [serialization.format=1]
Statistics  4 bytes 
Locationfile:/Users/wzh/Projects/spark/spark-warehouse/xx   
Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
Partition Provider  Catalog 
Time taken: 0.089 seconds, Fetched 19 row(s)

spark-sql> alter table xx set tblproperties ('prop1' = 'yy');
Time taken: 0.187 seconds

spark-sql> insert into table xx select 'c', 'd';
Time taken: 0.583 seconds

spark-sql> desc formatted xx;
# col_name  data_type   comment
i   string  NULL
j   string  NULL
# Detailed Table Information
Databasedefault 
Table   xx  
Owner   wzh 
Created Thu Jun 08 18:30:46 PDT 2017
Last Access Wed Dec 31 16:00:00 PST 1969
TypeMANAGED 
Providerhive
Properties  [serialization.format=1]
Statistics  4 bytes (-- This should be 8 bytes)
Locationfile:/Users/wzh/Projects/spark/spark-warehouse/xx   
Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormatorg.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
Partition Provider  Catalog 
Time taken: 0.077 seconds, Fetched 19 row(s)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"

2017-06-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043973#comment-16043973
 ] 

Dongjoon Hyun commented on SPARK-20954:
---

^^

> DESCRIBE showing 1 extra row of "| # col_name  | data_type  | comment  |"
> -
>
> Key: SPARK-20954
> URL: https://issues.apache.org/jira/browse/SPARK-20954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Garros Chan
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added 
> to the result. You can see there is this 1 extra row with "| # col_name  | 
> data_type  | comment  |" ; however, select and select count(*) only shows 1 
> row.
> I searched online a long time and do not find any useful information.
> Is this a bug?
> hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline
> Beeline version 1.2.1.spark2 by Apache Hive
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> beeline> !connect jdbc:hive2://localhost:10016
> Connecting to jdbc:hive2://localhost:10016
> Enter username for jdbc:hive2://localhost:10016: hive
> Enter password for jdbc:hive2://localhost:10016: 
> 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016
> 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016
> 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:10016
> Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10016> describe garros.hivefloat;
> +-++--+--+
> |  col_name   | data_type  | comment  |
> +-++--+--+
> | # col_name  | data_type  | comment  |
> | c1  | float  | NULL |
> +-++--+--+
> 2 rows selected (0.396 seconds)
> 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat;
> +-+--+
> | c1  |
> +-+--+
> | 123.99800109863281  |
> +-+--+
> 1 row selected (0.319 seconds)
> 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat;
> +---+--+
> | count(1)  |
> +---+--+
> | 1 |
> +---+--+
> 1 row selected (0.783 seconds)
> 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint;
> +---+-+--+--+
> |   col_name|  data_type  
> | comment  |
> +---+-+--+--+
> | # col_name| data_type   
> | comment  |
> | c1| int 
> | NULL |
> |   | 
> |  |
> | # Detailed Table Information  | 
> |  |
> | Database  | garros  
> |  |
> | Table | hiveint 
> |  |
> | Owner | root
> |  |
> | Created   | Thu Feb 09 17:40:36 EST 2017
> |  |
> | Last Access   | Wed Dec 31 19:00:00 EST 1969
> |  |
> | Type  | MANAGED 
> |  |
> | Provider  | hive 

[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"

2017-06-08 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043972#comment-16043972
 ] 

Wenchen Fan commented on SPARK-20954:
-

oh sorry I misclicked...

> DESCRIBE showing 1 extra row of "| # col_name  | data_type  | comment  |"
> -
>
> Key: SPARK-20954
> URL: https://issues.apache.org/jira/browse/SPARK-20954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Garros Chan
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added 
> to the result. You can see there is this 1 extra row with "| # col_name  | 
> data_type  | comment  |" ; however, select and select count(*) only shows 1 
> row.
> I searched online a long time and do not find any useful information.
> Is this a bug?
> hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline
> Beeline version 1.2.1.spark2 by Apache Hive
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> beeline> !connect jdbc:hive2://localhost:10016
> Connecting to jdbc:hive2://localhost:10016
> Enter username for jdbc:hive2://localhost:10016: hive
> Enter password for jdbc:hive2://localhost:10016: 
> 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016
> 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016
> 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:10016
> Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10016> describe garros.hivefloat;
> +-++--+--+
> |  col_name   | data_type  | comment  |
> +-++--+--+
> | # col_name  | data_type  | comment  |
> | c1  | float  | NULL |
> +-++--+--+
> 2 rows selected (0.396 seconds)
> 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat;
> +-+--+
> | c1  |
> +-+--+
> | 123.99800109863281  |
> +-+--+
> 1 row selected (0.319 seconds)
> 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat;
> +---+--+
> | count(1)  |
> +---+--+
> | 1 |
> +---+--+
> 1 row selected (0.783 seconds)
> 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint;
> +---+-+--+--+
> |   col_name|  data_type  
> | comment  |
> +---+-+--+--+
> | # col_name| data_type   
> | comment  |
> | c1| int 
> | NULL |
> |   | 
> |  |
> | # Detailed Table Information  | 
> |  |
> | Database  | garros  
> |  |
> | Table | hiveint 
> |  |
> | Owner | root
> |  |
> | Created   | Thu Feb 09 17:40:36 EST 2017
> |  |
> | Last Access   | Wed Dec 31 19:00:00 EST 1969
> |  |
> | Type  | MANAGED 
> |  |
> | Provider  | hive   

[jira] [Assigned] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"

2017-06-08 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20954:
---

Assignee: Dongjoon Hyun  (was: Liang-Chi Hsieh)

> DESCRIBE showing 1 extra row of "| # col_name  | data_type  | comment  |"
> -
>
> Key: SPARK-20954
> URL: https://issues.apache.org/jira/browse/SPARK-20954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Garros Chan
>Assignee: Dongjoon Hyun
> Fix For: 2.2.0
>
>
> I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added 
> to the result. You can see there is this 1 extra row with "| # col_name  | 
> data_type  | comment  |" ; however, select and select count(*) only shows 1 
> row.
> I searched online a long time and do not find any useful information.
> Is this a bug?
> hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline
> Beeline version 1.2.1.spark2 by Apache Hive
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> beeline> !connect jdbc:hive2://localhost:10016
> Connecting to jdbc:hive2://localhost:10016
> Enter username for jdbc:hive2://localhost:10016: hive
> Enter password for jdbc:hive2://localhost:10016: 
> 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016
> 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016
> 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:10016
> Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10016> describe garros.hivefloat;
> +-++--+--+
> |  col_name   | data_type  | comment  |
> +-++--+--+
> | # col_name  | data_type  | comment  |
> | c1  | float  | NULL |
> +-++--+--+
> 2 rows selected (0.396 seconds)
> 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat;
> +-+--+
> | c1  |
> +-+--+
> | 123.99800109863281  |
> +-+--+
> 1 row selected (0.319 seconds)
> 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat;
> +---+--+
> | count(1)  |
> +---+--+
> | 1 |
> +---+--+
> 1 row selected (0.783 seconds)
> 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint;
> +---+-+--+--+
> |   col_name|  data_type  
> | comment  |
> +---+-+--+--+
> | # col_name| data_type   
> | comment  |
> | c1| int 
> | NULL |
> |   | 
> |  |
> | # Detailed Table Information  | 
> |  |
> | Database  | garros  
> |  |
> | Table | hiveint 
> |  |
> | Owner | root
> |  |
> | Created   | Thu Feb 09 17:40:36 EST 2017
> |  |
> | Last Access   | Wed Dec 31 19:00:00 EST 1969
> |  |
> | Type  | MANAGED 
> |  |
> | Provider  | hive   

[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"

2017-06-08 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043948#comment-16043948
 ] 

Dongjoon Hyun commented on SPARK-20954:
---

Hi, Wenchen.
I'm Dongjoon. :)

> DESCRIBE showing 1 extra row of "| # col_name  | data_type  | comment  |"
> -
>
> Key: SPARK-20954
> URL: https://issues.apache.org/jira/browse/SPARK-20954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Garros Chan
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added 
> to the result. You can see there is this 1 extra row with "| # col_name  | 
> data_type  | comment  |" ; however, select and select count(*) only shows 1 
> row.
> I searched online a long time and do not find any useful information.
> Is this a bug?
> hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline
> Beeline version 1.2.1.spark2 by Apache Hive
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> beeline> !connect jdbc:hive2://localhost:10016
> Connecting to jdbc:hive2://localhost:10016
> Enter username for jdbc:hive2://localhost:10016: hive
> Enter password for jdbc:hive2://localhost:10016: 
> 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016
> 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016
> 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:10016
> Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10016> describe garros.hivefloat;
> +-++--+--+
> |  col_name   | data_type  | comment  |
> +-++--+--+
> | # col_name  | data_type  | comment  |
> | c1  | float  | NULL |
> +-++--+--+
> 2 rows selected (0.396 seconds)
> 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat;
> +-+--+
> | c1  |
> +-+--+
> | 123.99800109863281  |
> +-+--+
> 1 row selected (0.319 seconds)
> 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat;
> +---+--+
> | count(1)  |
> +---+--+
> | 1 |
> +---+--+
> 1 row selected (0.783 seconds)
> 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint;
> +---+-+--+--+
> |   col_name|  data_type  
> | comment  |
> +---+-+--+--+
> | # col_name| data_type   
> | comment  |
> | c1| int 
> | NULL |
> |   | 
> |  |
> | # Detailed Table Information  | 
> |  |
> | Database  | garros  
> |  |
> | Table | hiveint 
> |  |
> | Owner | root
> |  |
> | Created   | Thu Feb 09 17:40:36 EST 2017
> |  |
> | Last Access   | Wed Dec 31 19:00:00 EST 1969
> |  |
> | Type  | MANAGED 
> |  |
> | Provider  | hive

[jira] [Created] (SPARK-21030) extend hint syntax to support any expression for Python and R

2017-06-08 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-21030:


 Summary: extend hint syntax to support any expression for Python 
and R
 Key: SPARK-21030
 URL: https://issues.apache.org/jira/browse/SPARK-21030
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SparkR, SQL
Affects Versions: 2.2.0
Reporter: Felix Cheung


See SPARK-20854
we need to relax checks in 
https://github.com/apache/spark/blob/6cbc61d1070584ffbc34b1f53df352c9162f414a/python/pyspark/sql/dataframe.py#L422

and
https://github.com/apache/spark/blob/7f203a248f94df6183a4bc4642a3d873171fef29/R/pkg/R/DataFrame.R#L3746




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21029) All StreamingQuery should be stopped when the SparkSession is stopped

2017-06-08 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-21029:


 Summary: All StreamingQuery should be stopped when the 
SparkSession is stopped
 Key: SPARK-21029
 URL: https://issues.apache.org/jira/browse/SPARK-21029
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0, 2.3.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20510) SparkR 2.2 QA: Update user guide for new features & APIs

2017-06-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043895#comment-16043895
 ] 

Felix Cheung commented on SPARK-20510:
--

credit to
SPARK-20208
SPARK-20849
SPARK-20477
SPARK-20478
SPARK-20258
SPARK-20026
SPARK-20015


> SparkR 2.2 QA: Update user guide for new features & APIs
> 
>
> Key: SPARK-20510
> URL: https://issues.apache.org/jira/browse/SPARK-20510
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> If you would like to work on this task, please comment, and we can create & 
> link JIRAs for parts of this work.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20511) SparkR 2.2 QA: Check for new R APIs requiring example code

2017-06-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043894#comment-16043894
 ] 

Felix Cheung commented on SPARK-20511:
--

credit to
SPARK-20208
SPARK-20849
SPARK-20477
SPARK-20478
SPARK-20258
SPARK-20026
SPARK-20015


> SparkR 2.2 QA: Check for new R APIs requiring example code
> --
>
> Key: SPARK-20511
> URL: https://issues.apache.org/jira/browse/SPARK-20511
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>
> Audit list of new features added to MLlib's R API, and see which major items 
> are missing example code (in the examples folder).  We do not need examples 
> for everything, only for major items such as new algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20513) Update SparkR website for 2.2

2017-06-08 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043892#comment-16043892
 ] 

Felix Cheung commented on SPARK-20513:
--

right, I don't think there's a site for R
https://github.com/apache/spark-website


> Update SparkR website for 2.2
> -
>
> Key: SPARK-20513
> URL: https://issues.apache.org/jira/browse/SPARK-20513
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Update the sub-project's website to include new features in this release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"

2017-06-08 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-20954:
---

Assignee: Liang-Chi Hsieh

> DESCRIBE showing 1 extra row of "| # col_name  | data_type  | comment  |"
> -
>
> Key: SPARK-20954
> URL: https://issues.apache.org/jira/browse/SPARK-20954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Garros Chan
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added 
> to the result. You can see there is this 1 extra row with "| # col_name  | 
> data_type  | comment  |" ; however, select and select count(*) only shows 1 
> row.
> I searched online a long time and do not find any useful information.
> Is this a bug?
> hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline
> Beeline version 1.2.1.spark2 by Apache Hive
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> beeline> !connect jdbc:hive2://localhost:10016
> Connecting to jdbc:hive2://localhost:10016
> Enter username for jdbc:hive2://localhost:10016: hive
> Enter password for jdbc:hive2://localhost:10016: 
> 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016
> 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016
> 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:10016
> Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10016> describe garros.hivefloat;
> +-++--+--+
> |  col_name   | data_type  | comment  |
> +-++--+--+
> | # col_name  | data_type  | comment  |
> | c1  | float  | NULL |
> +-++--+--+
> 2 rows selected (0.396 seconds)
> 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat;
> +-+--+
> | c1  |
> +-+--+
> | 123.99800109863281  |
> +-+--+
> 1 row selected (0.319 seconds)
> 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat;
> +---+--+
> | count(1)  |
> +---+--+
> | 1 |
> +---+--+
> 1 row selected (0.783 seconds)
> 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint;
> +---+-+--+--+
> |   col_name|  data_type  
> | comment  |
> +---+-+--+--+
> | # col_name| data_type   
> | comment  |
> | c1| int 
> | NULL |
> |   | 
> |  |
> | # Detailed Table Information  | 
> |  |
> | Database  | garros  
> |  |
> | Table | hiveint 
> |  |
> | Owner | root
> |  |
> | Created   | Thu Feb 09 17:40:36 EST 2017
> |  |
> | Last Access   | Wed Dec 31 19:00:00 EST 1969
> |  |
> | Type  | MANAGED 
> |  |
> | Provider  | hive
> 

[jira] [Resolved] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"

2017-06-08 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20954.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18245
[https://github.com/apache/spark/pull/18245]

> DESCRIBE showing 1 extra row of "| # col_name  | data_type  | comment  |"
> -
>
> Key: SPARK-20954
> URL: https://issues.apache.org/jira/browse/SPARK-20954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Garros Chan
> Fix For: 2.2.0
>
>
> I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added 
> to the result. You can see there is this 1 extra row with "| # col_name  | 
> data_type  | comment  |" ; however, select and select count(*) only shows 1 
> row.
> I searched online a long time and do not find any useful information.
> Is this a bug?
> hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline
> Beeline version 1.2.1.spark2 by Apache Hive
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> beeline> !connect jdbc:hive2://localhost:10016
> Connecting to jdbc:hive2://localhost:10016
> Enter username for jdbc:hive2://localhost:10016: hive
> Enter password for jdbc:hive2://localhost:10016: 
> 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016
> 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016
> 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:10016
> Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10016> describe garros.hivefloat;
> +-++--+--+
> |  col_name   | data_type  | comment  |
> +-++--+--+
> | # col_name  | data_type  | comment  |
> | c1  | float  | NULL |
> +-++--+--+
> 2 rows selected (0.396 seconds)
> 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat;
> +-+--+
> | c1  |
> +-+--+
> | 123.99800109863281  |
> +-+--+
> 1 row selected (0.319 seconds)
> 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat;
> +---+--+
> | count(1)  |
> +---+--+
> | 1 |
> +---+--+
> 1 row selected (0.783 seconds)
> 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint;
> +---+-+--+--+
> |   col_name|  data_type  
> | comment  |
> +---+-+--+--+
> | # col_name| data_type   
> | comment  |
> | c1| int 
> | NULL |
> |   | 
> |  |
> | # Detailed Table Information  | 
> |  |
> | Database  | garros  
> |  |
> | Table | hiveint 
> |  |
> | Owner | root
> |  |
> | Created   | Thu Feb 09 17:40:36 EST 2017
> |  |
> | Last Access   | Wed Dec 31 19:00:00 EST 1969
> |  |
> | Type  | MANAGED 
> |  |
> | Provider  | hive

[jira] [Commented] (SPARK-20589) Allow limiting task concurrency per stage

2017-06-08 Thread Fei Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043877#comment-16043877
 ] 

Fei Shao commented on SPARK-20589:
--

Tasks are assigned to executors. If we set the number of executors to 5 and set 
the simulaneous task number to 2,  a contradiction occurs here.

So can we change the requirement to "allow limiting task concurrency per 
executor" please?

> Allow limiting task concurrency per stage
> -
>
> Key: SPARK-20589
> URL: https://issues.apache.org/jira/browse/SPARK-20589
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> It would be nice to have the ability to limit the number of concurrent tasks 
> per stage.  This is useful when your spark job might be accessing another 
> service and you don't want to DOS that service.  For instance Spark writing 
> to hbase or Spark doing http puts on a service.  Many times you want to do 
> this without limiting the number of partitions. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21001) Staging folders from Hive table are not being cleared.

2017-06-08 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043869#comment-16043869
 ] 

Liang-Chi Hsieh commented on SPARK-21001:
-

No, I mean the current 2.0 branch in git. I think there's no 2.0.3 release yet.

> Staging folders from Hive table are not being cleared.
> --
>
> Key: SPARK-21001
> URL: https://issues.apache.org/jira/browse/SPARK-21001
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Ajay Cherukuri
>
> Staging folders that were being created as a part of Data loading to Hive 
> table by using spark job, are not cleared.
> Staging folder are remaining in Hive External table folders even after Spark 
> job is completed.
> This is the same issue mentioned in the 
> ticket:https://issues.apache.org/jira/browse/SPARK-18372
> This ticket says the issues was resolved in 1.6.4. But, now i found that it's 
> still existing on 2.0.2.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-08 Thread Robert Kruszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043714#comment-16043714
 ] 

Robert Kruszewski edited comment on SPARK-20952 at 6/9/17 2:12 AM:
---

~~Right, but how do I pass it downstream?~~ So I would store it and restore it 
inside the threadpool? Now every spark contributor has to know about it but if 
that's preferred happy to modify.


was (Author: robert3005):
~Right, but how do I pass it downstream?~ So I would store it and restore it 
inside the threadpool? Now every spark contributor has to know about it but if 
that's preferred happy to modify.

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-08 Thread Robert Kruszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043714#comment-16043714
 ] 

Robert Kruszewski edited comment on SPARK-20952 at 6/9/17 2:13 AM:
---

-Right, but how do I pass it downstream?- So I would store it and restore it 
inside the threadpool? Now every spark contributor has to know about it but if 
that's preferred happy to modify.


was (Author: robert3005):
~~Right, but how do I pass it downstream?~~ So I would store it and restore it 
inside the threadpool? Now every spark contributor has to know about it but if 
that's preferred happy to modify.

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18075) UDF doesn't work on non-local spark

2017-06-08 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043778#comment-16043778
 ] 

Wenchen Fan commented on SPARK-18075:
-

for development/testing, you can special the spark master as {{local-cluster[4, 
8, 2048]}}, which simulates a 4 nodes spark cluster with 8 cores and 2g ram 
each node

> UDF doesn't work on non-local spark
> ---
>
> Key: SPARK-18075
> URL: https://issues.apache.org/jira/browse/SPARK-18075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Nick Orka
>
> I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz)
> According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 
> I've made all spark dependancies with PROVIDED scope. I use 100% same 
> versions of spark in the app as well as for spark server. 
> Here is my pom:
> {code:title=pom.xml}
> 
> 1.6
> 1.6
> UTF-8
> 2.11.8
> 2.0.0
> 2.7.0
> 
> 
> 
> 
> org.apache.spark
> spark-core_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-sql_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-hive_2.11
> ${spark.version}
> provided
> 
> 
> {code}
> As you can see all spark dependencies have provided scope
> And this is a code for reproduction:
> {code:title=udfTest.scala}
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.apache.spark.sql.{Row, SparkSession}
> /**
>   * Created by nborunov on 10/19/16.
>   */
> object udfTest {
>   class Seq extends Serializable {
> var i = 0
> def getVal: Int = {
>   i = i + 1
>   i
> }
>   }
>   def main(args: Array[String]) {
> val spark = SparkSession
>   .builder()
> .master("spark://nborunov-mbp.local:7077")
> //  .master("local")
>   .getOrCreate()
> val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two")))
> val schema = StructType(Array(StructField("name", StringType)))
> val df = spark.createDataFrame(rdd, schema)
> df.show()
> spark.udf.register("func", (name: String) => name.toUpperCase)
> import org.apache.spark.sql.functions.expr
> val newDf = df.withColumn("upperName", expr("func(name)"))
> newDf.show()
> val seq = new Seq
> spark.udf.register("seq", () => seq.getVal)
> val seqDf = df.withColumn("id", expr("seq()"))
> seqDf.show()
> df.createOrReplaceTempView("df")
> spark.sql("select *, seq() as sql_id from df").show()
>   }
> }
> {code}
> When .master("local") - everything works fine. When 
> .master("spark://...:7077"), it fails on line:
> {code}
> newDf.show()
> {code}
> The error is exactly the same:
> {code}
> scala> udfTest.main(Array())
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0
> 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(nborunov); 
> groups with view permissions: Set(); users  with modify permissions: 
> Set(nborunov); groups with modify permissions: Set()
> 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on 
> port 57828.
> 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker
> 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster
> 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at 
> /private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264
> 16/10/19 19:37:53 INFO MemoryStore: MemoryStore started with capacity 2004.6 
> MB
> 16/10/19 19

[jira] [Resolved] (SPARK-20863) Add metrics/instrumentation to LiveListenerBus

2017-06-08 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20863.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Add metrics/instrumentation to LiveListenerBus
> --
>
> Key: SPARK-20863
> URL: https://issues.apache.org/jira/browse/SPARK-20863
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.3.0
>
>
> I think that we should add Coda Hale metrics to the LiveListenerBus in order 
> to count the number of queued, processed, and dropped events, as well as a 
> timer tracking per-event processing times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13933) hadoop-2.7 profile's curator version should be 2.7.1

2017-06-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13933:


Assignee: (was: Apache Spark)

> hadoop-2.7 profile's curator version should be 2.7.1
> 
>
> Key: SPARK-13933
> URL: https://issues.apache.org/jira/browse/SPARK-13933
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Steve Loughran
>Priority: Minor
>
> This is pretty minor, more due diligence than any binary compatibility.
> # the {{hadoop-2.7}} profile declares the curator version to be 2.6.0
> # the actual hadoop-2.7.1 dependency is of curator 2.7.1; this came from 
> HADOOP-11492
> For consistency, the profile can/should be changed. However, note that as 
> well as some incompatibilities defined in HADOOP-11492; the version  of Guava 
> that curator asserts a need for is 15.x. HADOOP-11612 showed what needed to 
> be done to address compatibility problems there; one of the Curator classes 
> had to be forked to make compatible with guava 11+



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13933) hadoop-2.7 profile's curator version should be 2.7.1

2017-06-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043748#comment-16043748
 ] 

Apache Spark commented on SPARK-13933:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/18247

> hadoop-2.7 profile's curator version should be 2.7.1
> 
>
> Key: SPARK-13933
> URL: https://issues.apache.org/jira/browse/SPARK-13933
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Steve Loughran
>Priority: Minor
>
> This is pretty minor, more due diligence than any binary compatibility.
> # the {{hadoop-2.7}} profile declares the curator version to be 2.6.0
> # the actual hadoop-2.7.1 dependency is of curator 2.7.1; this came from 
> HADOOP-11492
> For consistency, the profile can/should be changed. However, note that as 
> well as some incompatibilities defined in HADOOP-11492; the version  of Guava 
> that curator asserts a need for is 15.x. HADOOP-11612 showed what needed to 
> be done to address compatibility problems there; one of the Curator classes 
> had to be forked to make compatible with guava 11+



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13933) hadoop-2.7 profile's curator version should be 2.7.1

2017-06-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13933:


Assignee: Apache Spark

> hadoop-2.7 profile's curator version should be 2.7.1
> 
>
> Key: SPARK-13933
> URL: https://issues.apache.org/jira/browse/SPARK-13933
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Minor
>
> This is pretty minor, more due diligence than any binary compatibility.
> # the {{hadoop-2.7}} profile declares the curator version to be 2.6.0
> # the actual hadoop-2.7.1 dependency is of curator 2.7.1; this came from 
> HADOOP-11492
> For consistency, the profile can/should be changed. However, note that as 
> well as some incompatibilities defined in HADOOP-11492; the version  of Guava 
> that curator asserts a need for is 15.x. HADOOP-11612 showed what needed to 
> be done to address compatibility problems there; one of the Curator classes 
> had to be forked to make compatible with guava 11+



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21028) Parallel One vs. Rest Classifier Scala

2017-06-08 Thread Ajay Saini (JIRA)
Ajay Saini created SPARK-21028:
--

 Summary: Parallel One vs. Rest Classifier Scala
 Key: SPARK-21028
 URL: https://issues.apache.org/jira/browse/SPARK-21028
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0, 2.2.1
Reporter: Ajay Saini


Adding a class for a parallel one vs. rest implementation to the ml package in 
Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21027) Parallel One vs. Rest Classifier

2017-06-08 Thread Ajay Saini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajay Saini updated SPARK-21027:
---
Description: Adding a class called ParOneVsRest that includes support for a 
parallelism parameter in a one vs. rest implementation. A parallel one vs. rest 
implementation gives up to a 2X speedup when tested on a dataset with 181024 
points. A ticket for the Scala implementation of this classifier is here: 
https://issues.apache.org/jira/browse/SPARK-21028  (was: Adding a class called 
ParOneVsRest that includes support for a parallelism parameter in a one vs. 
rest implementation. A parallel one vs. rest implementation gives up to a 2X 
speedup when tested on a dataset with 181024 points.)

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>
> Adding a class called ParOneVsRest that includes support for a parallelism 
> parameter in a one vs. rest implementation. A parallel one vs. rest 
> implementation gives up to a 2X speedup when tested on a dataset with 181024 
> points. A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21027) Parallel One vs. Rest Classifier

2017-06-08 Thread Ajay Saini (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajay Saini updated SPARK-21027:
---
Description: Adding a class called ParOneVsRest that includes support for a 
parallelism parameter in a one vs. rest implementation. A parallel one vs. rest 
implementation gives up to a 2X speedup when tested on a dataset with 181024 
points.  (was: Adding a class called ParOneVsRest that includes support for a 
parallelism parameter in a one vs. rest implementation.)

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>
> Adding a class called ParOneVsRest that includes support for a parallelism 
> parameter in a one vs. rest implementation. A parallel one vs. rest 
> implementation gives up to a 2X speedup when tested on a dataset with 181024 
> points.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21027) Parallel One vs. Rest Classifier

2017-06-08 Thread Ajay Saini (JIRA)
Ajay Saini created SPARK-21027:
--

 Summary: Parallel One vs. Rest Classifier
 Key: SPARK-21027
 URL: https://issues.apache.org/jira/browse/SPARK-21027
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 2.2.0, 2.2.1
Reporter: Ajay Saini


Adding a class called ParOneVsRest that includes support for a parallelism 
parameter in a one vs. rest implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-08 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043732#comment-16043732
 ] 

Shixiong Zhu commented on SPARK-20952:
--

For `ParquetFileFormat#readFootersInParallel`, I would suggest that you just 
set the TaskContext in "parFiles.flatMap". 

{code}
val taskContext = TaskContext.get

val parFiles = partFiles.par
parFiles.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(8))
parFiles.flatMap { currentFile =>
TaskContext.setTaskContext(taskContext)
...
}.seq
{code}

In this special case, it's safe since this is a local one-time thread pool.

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-08 Thread Robert Kruszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043714#comment-16043714
 ] 

Robert Kruszewski edited comment on SPARK-20952 at 6/9/17 12:28 AM:


~Right, but how do I pass it downstream?~ So I would store it and restore it 
inside the threadpool? Now every spark contributor has to know about it but if 
that's preferred happy to modify.


was (Author: robert3005):
Right, but how do I pass it downstream?

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-08 Thread Robert Kruszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043714#comment-16043714
 ] 

Robert Kruszewski commented on SPARK-20952:
---

Right, but how do I pass it downstream?

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-08 Thread Robert Kruszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043712#comment-16043712
 ] 

Robert Kruszewski commented on SPARK-20952:
---

It doesn't but things underneath it do. It's weird from consumer perspective 
that you have a feature that you can't really use because you can't assert that 
it behaves consistently. In my case we have some filesystem features relying on 
taskcontext

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-08 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043710#comment-16043710
 ] 

Shixiong Zhu commented on SPARK-20952:
--

Although I don't know what you plan to do, you can save the TaskContext into a 
local variable like this:
{code}
  private[parquet] def readParquetFootersInParallel(
  conf: Configuration,
  partFiles: Seq[FileStatus],
  ignoreCorruptFiles: Boolean): Seq[Footer] = {
val taskContext = TaskContext.get

val parFiles = partFiles.par
parFiles.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(8))
parFiles.flatMap { currentFile =>
  try {
// Use `taskContext` rather than `TaskContext.get`

// Skips row group information since we only need the schema.
// ParquetFileReader.readFooter throws RuntimeException, instead of 
IOException,
// when it can't read the footer.
Some(new Footer(currentFile.getPath(),
  ParquetFileReader.readFooter(
conf, currentFile, SKIP_ROW_GROUPS)))
  } catch { case e: RuntimeException =>
if (ignoreCorruptFiles) {
  logWarning(s"Skipped the footer in the corrupted file: $currentFile", 
e)
  None
} else {
  throw new IOException(s"Could not read footer for file: 
$currentFile", e)
}
  }
}.seq
  }
{code}

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-08 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043708#comment-16043708
 ] 

Shixiong Zhu commented on SPARK-20952:
--

Why it needs TaskContext?

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-08 Thread Robert Kruszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043704#comment-16043704
 ] 

Robert Kruszewski commented on SPARK-20952:
---

No modifications, it's this code 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L477
 which spins up a threadpool to read files per partition. I imagine there's 
more cases like this but first one I encountered

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20954) DESCRIBE showing 1 extra row of "| # col_name | data_type | comment |"

2017-06-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043700#comment-16043700
 ] 

Apache Spark commented on SPARK-20954:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/18245

> DESCRIBE showing 1 extra row of "| # col_name  | data_type  | comment  |"
> -
>
> Key: SPARK-20954
> URL: https://issues.apache.org/jira/browse/SPARK-20954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Garros Chan
>
> I am trying to do DESCRIBE on a table but seeing 1 extra row being auto-added 
> to the result. You can see there is this 1 extra row with "| # col_name  | 
> data_type  | comment  |" ; however, select and select count(*) only shows 1 
> row.
> I searched online a long time and do not find any useful information.
> Is this a bug?
> hdp106m2:/usr/hdp/2.5.0.2-3/spark2 # ./bin/beeline
> Beeline version 1.2.1.spark2 by Apache Hive
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: backward-delete-word
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> [INFO] Unable to bind key for unsupported operation: up-history
> [INFO] Unable to bind key for unsupported operation: down-history
> beeline> !connect jdbc:hive2://localhost:10016
> Connecting to jdbc:hive2://localhost:10016
> Enter username for jdbc:hive2://localhost:10016: hive
> Enter password for jdbc:hive2://localhost:10016: 
> 17/06/01 14:13:04 INFO Utils: Supplied authorities: localhost:10016
> 17/06/01 14:13:04 INFO Utils: Resolved authority: localhost:10016
> 17/06/01 14:13:04 INFO HiveConnection: Will try to open client transport with 
> JDBC Uri: jdbc:hive2://localhost:10016
> Connected to: Spark SQL (version 2.2.1-SNAPSHOT)
> Driver: Hive JDBC (version 1.2.1.spark2)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10016> describe garros.hivefloat;
> +-++--+--+
> |  col_name   | data_type  | comment  |
> +-++--+--+
> | # col_name  | data_type  | comment  |
> | c1  | float  | NULL |
> +-++--+--+
> 2 rows selected (0.396 seconds)
> 0: jdbc:hive2://localhost:10016> select * from garros.hivefloat;
> +-+--+
> | c1  |
> +-+--+
> | 123.99800109863281  |
> +-+--+
> 1 row selected (0.319 seconds)
> 0: jdbc:hive2://localhost:10016> select count(*) from garros.hivefloat;
> +---+--+
> | count(1)  |
> +---+--+
> | 1 |
> +---+--+
> 1 row selected (0.783 seconds)
> 0: jdbc:hive2://localhost:10016> describe formatted garros.hiveint;
> +---+-+--+--+
> |   col_name|  data_type  
> | comment  |
> +---+-+--+--+
> | # col_name| data_type   
> | comment  |
> | c1| int 
> | NULL |
> |   | 
> |  |
> | # Detailed Table Information  | 
> |  |
> | Database  | garros  
> |  |
> | Table | hiveint 
> |  |
> | Owner | root
> |  |
> | Created   | Thu Feb 09 17:40:36 EST 2017
> |  |
> | Last Access   | Wed Dec 31 19:00:00 EST 1969
> |  |
> | Type  | MANAGED 
> |  |
> | Provider  | hiv

[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-08 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043694#comment-16043694
 ] 

Shixiong Zhu commented on SPARK-20952:
--

[~robert3005] could you show me your codes? Are you modifying 
"ParquetFileFormat#readFootersInParallel"?

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-08 Thread Robert Kruszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043669#comment-16043669
 ] 

Robert Kruszewski commented on SPARK-20952:
---

I am not really attached to the solution. Would be happy to implement anything 
that maintainers are happy with as long as it ensures we get taskcontext always 
anywhere on the task side. For instance issue I am facing now is that 
ParquetFileFormat#readFootersInParallel is not able to access it leading to 
failures.

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-08 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043655#comment-16043655
 ] 

Shixiong Zhu commented on SPARK-20952:
--

If TaskContext is not inheritable, we can always find a way to pass it to the 
codes that need to access it. But if it's inheritable, it's pretty hard to 
avoid TaskContext pollution (or avoid using a stale TaskContext, you have to 
always set it manually in a task running in a cached thread).

[~joshrosen] listed many tickets that are caused by localProperties is 
InheritableThreadLocal: 
https://issues.apache.org/jira/browse/SPARK-14686?focusedCommentId=15244478&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15244478

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20211) `1 > 0.0001` throws Decimal scale (0) cannot be greater than precision (-2) exception

2017-06-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043639#comment-16043639
 ] 

Apache Spark commented on SPARK-20211:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/18244

> `1 > 0.0001` throws Decimal scale (0) cannot be greater than precision (-2) 
> exception
> -
>
> Key: SPARK-20211
> URL: https://issues.apache.org/jira/browse/SPARK-20211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1
>Reporter: StanZhai
>  Labels: correctness
>
> The following SQL:
> {code}
> select 1 > 0.0001 from tb
> {code}
> throws Decimal scale (0) cannot be greater than precision (-2) exception in 
> Spark 2.x.
> `floor(0.0001)` and `ceil(0.0001)` have the same problem in Spark 1.6.x and 
> Spark 2.x.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures

2017-06-08 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043622#comment-16043622
 ] 

Josh Rosen commented on SPARK-20178:


Update: I commented over on 
https://github.com/apache/spark/pull/18150#discussion_r121018254. I now think 
that [~sitalke...@gmail.com]'s original approach is a good move for now. If 
there's controversy then I propose to add an experimental feature-flag to let 
users fall back to older behavior.

> Improve Scheduler fetch failures
> 
>
> Key: SPARK-20178
> URL: https://issues.apache.org/jira/browse/SPARK-20178
> Project: Spark
>  Issue Type: Epic
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> We have been having a lot of discussions around improving the handling of 
> fetch failures.  There are 4 jira currently related to this.  
> We should try to get a list of things we want to improve and come up with one 
> cohesive design.
> SPARK-20163,  SPARK-20091,  SPARK-14649 , and SPARK-19753
> I will put my initial thoughts in a follow on comment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20952) TaskContext should be an InheritableThreadLocal

2017-06-08 Thread Robert Kruszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043623#comment-16043623
 ] 

Robert Kruszewski commented on SPARK-20952:
---

This is already an issue though on driver side (that threadpool is driver side 
which already has inheritable thread pool). This issue is only so we have same 
behaviour on executors and driver

> TaskContext should be an InheritableThreadLocal
> ---
>
> Key: SPARK-20952
> URL: https://issues.apache.org/jira/browse/SPARK-20952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Robert Kruszewski
>Priority: Minor
>
> TaskContext is a ThreadLocal as a result when you fork a thread inside your 
> executor task you lose the handle on the original context set by the 
> executor. We should change it to InheritableThreadLocal so we can access it 
> inside thread pools on executors. 
> See ParquetFileFormat#readFootersInParallel for example of code that uses 
> thread pools inside the tasks.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20953) Add hash map metrics to aggregate and join

2017-06-08 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043581#comment-16043581
 ] 

Reynold Xin commented on SPARK-20953:
-

I'd show the avg in the UI if possible. As a matter of fact maybe only show the 
avg.


> Add hash map metrics to aggregate and join
> --
>
> Key: SPARK-20953
> URL: https://issues.apache.org/jira/browse/SPARK-20953
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> It would be useful if we can identify hash map collision issues early on.
> We should add avg hash map probe metric to aggregate operator and hash join 
> operator and report them. If the avg probe is greater than a specific 
> (configurable) threshold, we should log an error at runtime.
> The primary classes to look at are UnsafeFixedWidthAggregationMap, 
> HashAggregateExec, HashedRelation, HashJoin.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster

2017-06-08 Thread Nico Pappagianis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043550#comment-16043550
 ] 

Nico Pappagianis commented on SPARK-10795:
--

[~HackerWilson] Were you able to resolve this? I'm hitting the same thing 
running Spark 2.0.1 and Hadoop 2.7.2.

My Python code is just creating a SparkContext and then calling sc.stop().

In the YARN logs I see:

INFO: 2017-06-08 22:16:24,462 INFO  [main] yarn.Client - Uploading resource 
file:/home/.../python/lib/py4j-0.10.1-src.zip -> 
hdfs://.../.sparkStaging/application_1494012577752_1403/py4j-0.10.1-src.zip

when I do an fs -ls on the above HDFS directory it shows the py4j file, but the 
job fails with a FileNotFoundException for the py4j file above:

File does not exist: 
hdfs://.../.sparkStaging/application_1494012577752_1403/py4j-0.10.1-src.zip
(stack trace here: 
https://gist.github.com/anonymous/5506654b88e19e6f51ffbd85cd3f25ee)

One thing to note is that I am launching a Map-only job that launches a the 
Spark application on the cluster. The launcher job is using SparkLauncher 
(Java). Master and deploy mode are set to "yarn" and "cluster", respectively.

When I submit the Python job from via a spark-submit it runs successfully (I 
set the HADOOP_CONF_DIR and HADOOP_JAVA_HOME to the same as what I am setting 
using the launcher job).






> FileNotFoundException while deploying pyspark job on cluster
> 
>
> Key: SPARK-10795
> URL: https://issues.apache.org/jira/browse/SPARK-10795
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: EMR 
>Reporter: Harshit
>
> I am trying to run simple spark job using pyspark, it works as standalone , 
> but while I deploy over cluster it fails.
> Events :
> 2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) 
> - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
> hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> Above uploading resource file is successfull , I manually checked file is 
> present in above specified path , but after a while I face following error :
> Diagnostics: File does not exist: 
> hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-08 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043478#comment-16043478
 ] 

Cheng Lian commented on SPARK-20958:


[~marmbrus], here is the draft release note entry:
{quote}
SPARK-20958: For users who use parquet-avro together with Spark 2.2, please use 
parquet-avro 1.8.1 instead of parquet-avro 1.8.2. This is because parquet-avro 
1.8.2 upgrades avro from 1.7.6 to 1.8.1, which is backward incompatible with 
1.7.6.
{quote}

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: release-notes, release_notes, releasenotes
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21026) Document jenkins plug-ins assumed by the spark documentation build

2017-06-08 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-21026:
--

 Summary: Document jenkins plug-ins assumed by the spark 
documentation build
 Key: SPARK-21026
 URL: https://issues.apache.org/jira/browse/SPARK-21026
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 2.1.1
Reporter: Erik Erlandson


I haven't been able to find documentation on what plug-ins the spark doc build 
assumes for jenkins.  Is there a list somewhere, or a gemfile?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21018) "Completed Jobs" and "Completed Stages" support pagination

2017-06-08 Thread Alex Bozarth (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Bozarth resolved SPARK-21018.
--
Resolution: Duplicate

This was added in Spark 2.1 by SPARK-15590

> "Completed Jobs" and "Completed Stages" support pagination
> --
>
> Key: SPARK-21018
> URL: https://issues.apache.org/jira/browse/SPARK-21018
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.2
>Reporter: Jinhua Fu
>Priority: Minor
> Attachments: CompletedJobs.png, PagedTasks.png
>
>
> When using Thriftsever, the number of jobs and Stages may be very large, and 
> if not paginated, the page will be very long and slow to load, especially 
> when spark.ui.retainedJobs is set to a large value. So I suggest "completed 
> Jobs" and "completed Stages" support pagination.
> I'd like to change them to a paging display similar to the tasks in the 
> "Details for Stage" page.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-06-08 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-20958:
---
Labels: release-notes release_notes releasenotes  (was: release-notes)

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: release-notes, release_notes, releasenotes
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21025) missing data in jsc.union

2017-06-08 Thread meng xi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meng xi updated SPARK-21025:

Comment: was deleted

(was: I attached the Java file)

> missing data in jsc.union
> -
>
> Key: SPARK-21025
> URL: https://issues.apache.org/jira/browse/SPARK-21025
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu 16.04
>Reporter: meng xi
> Attachments: SparkTest.java
>
>
> we are using an iterator of RDD for some special data processing, and then 
> using union to rebuild a new RDD. we found the result RDD are often empty or 
> missing most of the data. Here is a simplified code snippet for this bug:
> SparkConf sparkConf = new 
> SparkConf().setAppName("Test").setMaster("local[*]");
> SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
> JavaSparkContext jsc = 
> JavaSparkContext.fromSparkContext(sparkContext);
> JavaRDD src = jsc.parallelize(IntStream.range(0, 
> 3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
> Iterator it = src.toLocalIterator();
> List> rddList = new LinkedList<>();
> List resultBuffer = new LinkedList<>();
> while (it.hasNext()) {
> resultBuffer.add(it.next());
> if (resultBuffer.size() == 1000) {
> JavaRDD rdd = jsc.parallelize(resultBuffer);
> //rdd.count();
> rddList.add(rdd);
> resultBuffer.clear();
> }
> }
> JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
> rddList);
> System.out.println(desc.count());
> this code should duplicate the original RDD, but it just returns an empty 
> RDD. Please note that if I uncomment the rdd.count, it will return the 
> correct result. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21025) missing data in jsc.union

2017-06-08 Thread meng xi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meng xi reopened SPARK-21025:
-

I attached the Java file

> missing data in jsc.union
> -
>
> Key: SPARK-21025
> URL: https://issues.apache.org/jira/browse/SPARK-21025
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu 16.04
>Reporter: meng xi
> Attachments: SparkTest.java
>
>
> we are using an iterator of RDD for some special data processing, and then 
> using union to rebuild a new RDD. we found the result RDD are often empty or 
> missing most of the data. Here is a simplified code snippet for this bug:
> SparkConf sparkConf = new 
> SparkConf().setAppName("Test").setMaster("local[*]");
> SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
> JavaSparkContext jsc = 
> JavaSparkContext.fromSparkContext(sparkContext);
> JavaRDD src = jsc.parallelize(IntStream.range(0, 
> 3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
> Iterator it = src.toLocalIterator();
> List> rddList = new LinkedList<>();
> List resultBuffer = new LinkedList<>();
> while (it.hasNext()) {
> resultBuffer.add(it.next());
> if (resultBuffer.size() == 1000) {
> JavaRDD rdd = jsc.parallelize(resultBuffer);
> //rdd.count();
> rddList.add(rdd);
> resultBuffer.clear();
> }
> }
> JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
> rddList);
> System.out.println(desc.count());
> this code should duplicate the original RDD, but it just returns an empty 
> RDD. Please note that if I uncomment the rdd.count, it will return the 
> correct result. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21025) missing data in jsc.union

2017-06-08 Thread meng xi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043390#comment-16043390
 ] 

meng xi commented on SPARK-21025:
-

I attached the JAVA file, which does not have the format issue

> missing data in jsc.union
> -
>
> Key: SPARK-21025
> URL: https://issues.apache.org/jira/browse/SPARK-21025
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu 16.04
>Reporter: meng xi
> Attachments: SparkTest.java
>
>
> we are using an iterator of RDD for some special data processing, and then 
> using union to rebuild a new RDD. we found the result RDD are often empty or 
> missing most of the data. Here is a simplified code snippet for this bug:
> SparkConf sparkConf = new 
> SparkConf().setAppName("Test").setMaster("local[*]");
> SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
> JavaSparkContext jsc = 
> JavaSparkContext.fromSparkContext(sparkContext);
> JavaRDD src = jsc.parallelize(IntStream.range(0, 
> 3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
> Iterator it = src.toLocalIterator();
> List> rddList = new LinkedList<>();
> List resultBuffer = new LinkedList<>();
> while (it.hasNext()) {
> resultBuffer.add(it.next());
> if (resultBuffer.size() == 1000) {
> JavaRDD rdd = jsc.parallelize(resultBuffer);
> //rdd.count();
> rddList.add(rdd);
> resultBuffer.clear();
> }
> }
> JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
> rddList);
> System.out.println(desc.count());
> this code should duplicate the original RDD, but it just returns an empty 
> RDD. Please note that if I uncomment the rdd.count, it will return the 
> correct result. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21025) missing data in jsc.union

2017-06-08 Thread meng xi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meng xi updated SPARK-21025:

Attachment: SparkTest.java

> missing data in jsc.union
> -
>
> Key: SPARK-21025
> URL: https://issues.apache.org/jira/browse/SPARK-21025
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu 16.04
>Reporter: meng xi
> Attachments: SparkTest.java
>
>
> we are using an iterator of RDD for some special data processing, and then 
> using union to rebuild a new RDD. we found the result RDD are often empty or 
> missing most of the data. Here is a simplified code snippet for this bug:
> SparkConf sparkConf = new 
> SparkConf().setAppName("Test").setMaster("local[*]");
> SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
> JavaSparkContext jsc = 
> JavaSparkContext.fromSparkContext(sparkContext);
> JavaRDD src = jsc.parallelize(IntStream.range(0, 
> 3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
> Iterator it = src.toLocalIterator();
> List> rddList = new LinkedList<>();
> List resultBuffer = new LinkedList<>();
> while (it.hasNext()) {
> resultBuffer.add(it.next());
> if (resultBuffer.size() == 1000) {
> JavaRDD rdd = jsc.parallelize(resultBuffer);
> //rdd.count();
> rddList.add(rdd);
> resultBuffer.clear();
> }
> }
> JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
> rddList);
> System.out.println(desc.count());
> this code should duplicate the original RDD, but it just returns an empty 
> RDD. Please note that if I uncomment the rdd.count, it will return the 
> correct result. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21025) missing data in jsc.union

2017-06-08 Thread meng xi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meng xi updated SPARK-21025:

Description: 
we are using an iterator of RDD for some special data processing, and then 
using union to rebuild a new RDD. we found the result RDD are often empty or 
missing most of the data. Here is a simplified code snippet for this bug:

SparkConf sparkConf = new 
SparkConf().setAppName("Test").setMaster("local[*]");
SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext);
JavaRDD src = jsc.parallelize(IntStream.range(0, 
3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
Iterator it = src.toLocalIterator();
List> rddList = new LinkedList<>();
List resultBuffer = new LinkedList<>();
while (it.hasNext()) {
resultBuffer.add(it.next());
if (resultBuffer.size() == 1000) {
JavaRDD rdd = jsc.parallelize(resultBuffer);
//rdd.count();
rddList.add(rdd);
resultBuffer.clear();
}
}
JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
rddList);
System.out.println(desc.count());

this code should duplicate the original RDD, but it just returns an empty RDD. 
Please note that if I uncomment the rdd.count, it will return the correct 
result. 

  was:
we are using an iterator of RDD for some special data processing, and then 
using union to rebuild a new RDD. we found the result RDD are often empty or 
missing most of the data. Here is a simplified code snippet for this bug:

SparkConf sparkConf = new 
SparkConf().setAppName("Test").setMaster("local[*]");
SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext);
JavaRDD src = jsc.parallelize(IntStream.range(0, 
3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
Iterator it = src.toLocalIterator();
List> rddList = new LinkedList<>();
List resultBuffer = new LinkedList<>();
while (it.hasNext()) {
resultBuffer.add(it.next());
if (resultBuffer.size() == 1000) {
JavaRDD rdd = jsc.parallelize(resultBuffer);

//rdd.count();
rddList.add(rdd);
resultBuffer.clear();
}
}
JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
rddList);
System.out.println(desc.count());

this code should duplicate the original RDD, but it just returns an empty RDD. 
Please note that if I uncomment the rdd.count, it will return the correct 
result. 


> missing data in jsc.union
> -
>
> Key: SPARK-21025
> URL: https://issues.apache.org/jira/browse/SPARK-21025
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu 16.04
>Reporter: meng xi
>
> we are using an iterator of RDD for some special data processing, and then 
> using union to rebuild a new RDD. we found the result RDD are often empty or 
> missing most of the data. Here is a simplified code snippet for this bug:
> SparkConf sparkConf = new 
> SparkConf().setAppName("Test").setMaster("local[*]");
> SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
> JavaSparkContext jsc = 
> JavaSparkContext.fromSparkContext(sparkContext);
> JavaRDD src = jsc.parallelize(IntStream.range(0, 
> 3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
> Iterator it = src.toLocalIterator();
> List> rddList = new LinkedList<>();
> List resultBuffer = new LinkedList<>();
> while (it.hasNext()) {
> resultBuffer.add(it.next());
> if (resultBuffer.size() == 1000) {
> JavaRDD rdd = jsc.parallelize(resultBuffer);
> //rdd.count();
> rddList.add(rdd);
> resultBuffer.clear();
> }
> }
> JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
> rddList);
> System.out.println(desc.count());
> this code should duplicate the original RDD, but it just returns an empty 
> RDD. Please note that if I uncomment the rdd.count, it will return the 
> correct result. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21025) missing data in jsc.union

2017-06-08 Thread meng xi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meng xi updated SPARK-21025:

Description: 
we are using an iterator of RDD for some special data processing, and then 
using union to rebuild a new RDD. we found the result RDD are often empty or 
missing most of the data. Here is a simplified code snippet for this bug:

SparkConf sparkConf = new 
SparkConf().setAppName("Test").setMaster("local[*]");
SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext);
JavaRDD src = jsc.parallelize(IntStream.range(0, 
3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
Iterator it = src.toLocalIterator();
List> rddList = new LinkedList<>();
List resultBuffer = new LinkedList<>();
while (it.hasNext()) {
resultBuffer.add(it.next());
if (resultBuffer.size() == 1000) {
JavaRDD rdd = jsc.parallelize(resultBuffer);
//rdd.count();
rddList.add(rdd);
resultBuffer.clear();
}
}
JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
rddList);
System.out.println(desc.count());

this code should duplicate the original RDD, but it just returns an empty RDD. 
Please note that if I uncomment the rdd.count, it will return the correct 
result. 

  was:
we are using an iterator of RDD for some special data processing, and then 
using union to rebuild a new RDD. we found the result RDD are often empty or 
missing most of the data. Here is a simplified code snippet for this bug:

SparkConf sparkConf = new 
SparkConf().setAppName("Test").setMaster("local[*]");
SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext);
JavaRDD src = jsc.parallelize(IntStream.range(0, 
3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
Iterator it = src.toLocalIterator();
List> rddList = new LinkedList<>();
List resultBuffer = new LinkedList<>();
while (it.hasNext()) {
resultBuffer.add(it.next());
if (resultBuffer.size() == 1000) {
JavaRDD rdd = jsc.parallelize(resultBuffer);
//rdd.count();
rddList.add(rdd);
resultBuffer.clear();
}
}
JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
rddList);
System.out.println(desc.count());

this code should duplicate the original RDD, but it just returns an empty RDD. 
Please note that if I uncomment the rdd.count, it will return the correct 
result. 


> missing data in jsc.union
> -
>
> Key: SPARK-21025
> URL: https://issues.apache.org/jira/browse/SPARK-21025
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu 16.04
>Reporter: meng xi
>
> we are using an iterator of RDD for some special data processing, and then 
> using union to rebuild a new RDD. we found the result RDD are often empty or 
> missing most of the data. Here is a simplified code snippet for this bug:
> SparkConf sparkConf = new 
> SparkConf().setAppName("Test").setMaster("local[*]");
> SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
> JavaSparkContext jsc = 
> JavaSparkContext.fromSparkContext(sparkContext);
> JavaRDD src = jsc.parallelize(IntStream.range(0, 
> 3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
> Iterator it = src.toLocalIterator();
> List> rddList = new LinkedList<>();
> List resultBuffer = new LinkedList<>();
> while (it.hasNext()) {
> resultBuffer.add(it.next());
> if (resultBuffer.size() == 1000) {
> JavaRDD rdd = jsc.parallelize(resultBuffer);
> //rdd.count();
> rddList.add(rdd);
> resultBuffer.clear();
> }
> }
> JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
> rddList);
> System.out.println(desc.count());
> this code should duplicate the original RDD, but it just returns an empty 
> RDD. Please note that if I uncomment the rdd.count, it will return the 
> correct result. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21025) missing data in jsc.union

2017-06-08 Thread meng xi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meng xi updated SPARK-21025:

Description: 
we are using an iterator of RDD for some special data processing, and then 
using union to rebuild a new RDD. we found the result RDD are often empty or 
missing most of the data. Here is a simplified code snippet for this bug:

SparkConf sparkConf = new 
SparkConf().setAppName("Test").setMaster("local[*]");
SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext);
JavaRDD src = jsc.parallelize(IntStream.range(0, 
3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
Iterator it = src.toLocalIterator();
List> rddList = new LinkedList<>();
List resultBuffer = new LinkedList<>();
while (it.hasNext()) {
resultBuffer.add(it.next());
if (resultBuffer.size() == 1000) {
JavaRDD rdd = jsc.parallelize(resultBuffer);

//rdd.count();

rddList.add(rdd);

resultBuffer.clear();
}
}
JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
rddList);
System.out.println(desc.count());

this code should duplicate the original RDD, but it just returns an empty RDD. 
Please note that if I uncomment the rdd.count, it will return the correct 
result. 

  was:
we are using an iterator of RDD for some special data processing, and then 
using union to rebuild a new RDD. we found the result RDD are often empty or 
missing most of the data. Here is a simplified code snippet for this bug:

SparkConf sparkConf = new 
SparkConf().setAppName("Test").setMaster("local[*]");
SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext);
JavaRDD src = jsc.parallelize(IntStream.range(0, 
3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
Iterator it = src.toLocalIterator();
List> rddList = new LinkedList<>();
List resultBuffer = new LinkedList<>();
while (it.hasNext()) {
resultBuffer.add(it.next());
if (resultBuffer.size() == 1000) {
JavaRDD rdd = jsc.parallelize(resultBuffer);
//rdd.count();
rddList.add(rdd);
resultBuffer.clear();
}
}
JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
rddList);
System.out.println(desc.count());

this code should duplicate the original RDD, but it just returns an empty RDD. 
Please note that if I uncomment the rdd.count, it will return the correct 
result. 


> missing data in jsc.union
> -
>
> Key: SPARK-21025
> URL: https://issues.apache.org/jira/browse/SPARK-21025
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu 16.04
>Reporter: meng xi
>
> we are using an iterator of RDD for some special data processing, and then 
> using union to rebuild a new RDD. we found the result RDD are often empty or 
> missing most of the data. Here is a simplified code snippet for this bug:
> SparkConf sparkConf = new 
> SparkConf().setAppName("Test").setMaster("local[*]");
> SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
> JavaSparkContext jsc = 
> JavaSparkContext.fromSparkContext(sparkContext);
> JavaRDD src = jsc.parallelize(IntStream.range(0, 
> 3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
> Iterator it = src.toLocalIterator();
> List> rddList = new LinkedList<>();
> List resultBuffer = new LinkedList<>();
> while (it.hasNext()) {
> resultBuffer.add(it.next());
> if (resultBuffer.size() == 1000) {
> JavaRDD rdd = jsc.parallelize(resultBuffer);
> //rdd.count();
> rddList.add(rdd);
> resultBuffer.clear();
> }
> }
> JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
> rddList);
> System.out.println(desc.count());
> this code should duplicate the original RDD, but it just returns an empty 
> RDD. Please note that if I uncomment the rdd.count, it will return the 
> correct result. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21025) missing data in jsc.union

2017-06-08 Thread meng xi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meng xi updated SPARK-21025:

Description: 
we are using an iterator of RDD for some special data processing, and then 
using union to rebuild a new RDD. we found the result RDD are often empty or 
missing most of the data. Here is a simplified code snippet for this bug:

SparkConf sparkConf = new 
SparkConf().setAppName("Test").setMaster("local[*]");
SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext);
JavaRDD src = jsc.parallelize(IntStream.range(0, 
3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
Iterator it = src.toLocalIterator();
List> rddList = new LinkedList<>();
List resultBuffer = new LinkedList<>();
while (it.hasNext()) {
resultBuffer.add(it.next());
if (resultBuffer.size() == 1000) {
JavaRDD rdd = jsc.parallelize(resultBuffer);

//rdd.count();
rddList.add(rdd);
resultBuffer.clear();
}
}
JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
rddList);
System.out.println(desc.count());

this code should duplicate the original RDD, but it just returns an empty RDD. 
Please note that if I uncomment the rdd.count, it will return the correct 
result. 

  was:
we are using an iterator of RDD for some special data processing, and then 
using union to rebuild a new RDD. we found the result RDD are often empty or 
missing most of the data. Here is a simplified code snippet for this bug:

SparkConf sparkConf = new 
SparkConf().setAppName("Test").setMaster("local[*]");
SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext);
JavaRDD src = jsc.parallelize(IntStream.range(0, 
3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
Iterator it = src.toLocalIterator();
List> rddList = new LinkedList<>();
List resultBuffer = new LinkedList<>();
while (it.hasNext()) {
resultBuffer.add(it.next());
if (resultBuffer.size() == 1000) {
JavaRDD rdd = jsc.parallelize(resultBuffer);

//rdd.count();

rddList.add(rdd);

resultBuffer.clear();
}
}
JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
rddList);
System.out.println(desc.count());

this code should duplicate the original RDD, but it just returns an empty RDD. 
Please note that if I uncomment the rdd.count, it will return the correct 
result. 


> missing data in jsc.union
> -
>
> Key: SPARK-21025
> URL: https://issues.apache.org/jira/browse/SPARK-21025
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu 16.04
>Reporter: meng xi
>
> we are using an iterator of RDD for some special data processing, and then 
> using union to rebuild a new RDD. we found the result RDD are often empty or 
> missing most of the data. Here is a simplified code snippet for this bug:
> SparkConf sparkConf = new 
> SparkConf().setAppName("Test").setMaster("local[*]");
> SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
> JavaSparkContext jsc = 
> JavaSparkContext.fromSparkContext(sparkContext);
> JavaRDD src = jsc.parallelize(IntStream.range(0, 
> 3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
> Iterator it = src.toLocalIterator();
> List> rddList = new LinkedList<>();
> List resultBuffer = new LinkedList<>();
> while (it.hasNext()) {
> resultBuffer.add(it.next());
> if (resultBuffer.size() == 1000) {
> JavaRDD rdd = jsc.parallelize(resultBuffer);
> //rdd.count();
> rddList.add(rdd);
> resultBuffer.clear();
> }
> }
> JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
> rddList);
> System.out.println(desc.count());
> this code should duplicate the original RDD, but it just returns an empty 
> RDD. Please note that if I uncomment the rdd.count, it will return the 
> correct result. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21025) missing data in jsc.union

2017-06-08 Thread meng xi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043379#comment-16043379
 ] 

meng xi commented on SPARK-21025:
-

no, I just comment out one line, the system incorrectly format my code in this 
way...

Okey, let me explain a little bit about our code logic: we would like to do a 
"carry forward" data cleansing, which uses the previous data point to fill up 
missing field in current data. After scan the whole RDD, we reconstruct the 
RDD. this snippet is just clone the original one, but if you run it, the result 
RDD is empty

> missing data in jsc.union
> -
>
> Key: SPARK-21025
> URL: https://issues.apache.org/jira/browse/SPARK-21025
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu 16.04
>Reporter: meng xi
>
> we are using an iterator of RDD for some special data processing, and then 
> using union to rebuild a new RDD. we found the result RDD are often empty or 
> missing most of the data. Here is a simplified code snippet for this bug:
> SparkConf sparkConf = new 
> SparkConf().setAppName("Test").setMaster("local[*]");
> SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
> JavaSparkContext jsc = 
> JavaSparkContext.fromSparkContext(sparkContext);
> JavaRDD src = jsc.parallelize(IntStream.range(0, 
> 3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
> Iterator it = src.toLocalIterator();
> List> rddList = new LinkedList<>();
> List resultBuffer = new LinkedList<>();
> while (it.hasNext()) {
> resultBuffer.add(it.next());
> if (resultBuffer.size() == 1000) {
> JavaRDD rdd = jsc.parallelize(resultBuffer);
> //rdd.count();
> rddList.add(rdd);
> resultBuffer.clear();
> }
> }
> JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
> rddList);
> System.out.println(desc.count());
> this code should duplicate the original RDD, but it just returns an empty 
> RDD. Please note that if I uncomment the rdd.count, it will return the 
> correct result. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21022) RDD.foreach swallows exceptions

2017-06-08 Thread Colin Woodbury (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Woodbury closed SPARK-21022.
--
Resolution: Invalid

Wasn't actually a bug - `foreach` _doesn't_ actually swallow exceptions.

> RDD.foreach swallows exceptions
> ---
>
> Key: SPARK-21022
> URL: https://issues.apache.org/jira/browse/SPARK-21022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Colin Woodbury
>Assignee: Shixiong Zhu
>Priority: Minor
>
> A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown 
> inside its closure, but not if the exception was thrown earlier in the call 
> chain. An example:
> {code:none}
>  package examples
>  import org.apache.spark._
>  object Shpark {
>def main(args: Array[String]) {
>  implicit val sc: SparkContext = new SparkContext(
>new SparkConf().setMaster("local[*]").setAppName("blahfoobar")
>  )
>  /* DOESN'T THROW 
> 
>  sc.parallelize(0 until 1000) 
> 
>.foreachPartition { _.map { i =>   
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}} 
> 
>   */
>  /* DOESN'T THROW, nor does anything print.   
> 
>   * Commenting out the exception runs the prints. 
> 
>   * (i.e. `foreach` is sufficient to "run" an RDD)
> 
>  sc.parallelize(0 until 10)   
> 
>.foreach({ i =>
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}) 
> 
>   */
>  /* Throws! */
>  sc.parallelize(0 until 10)
>.map({ i =>
>  println("BEFORE THROW")
>  throw new Exception("Testing exception handling")
>  i
>})
>.foreach(i => println(i))
>  println("JOB DONE!")
>  System.in.read
>  sc.stop()
>}
>  }
> {code}
> When exceptions are swallowed, the jobs don't seem to fail, and the driver 
> exits normally. When one _is_ thrown, as in the last example, the exception 
> successfully rises up to the driver and can be caught with try/catch.
> The expected behaviour is for exceptions in `foreach` to throw and crash the 
> driver, as they would with `map`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21022) RDD.foreach swallows exceptions

2017-06-08 Thread Colin Woodbury (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043368#comment-16043368
 ] 

Colin Woodbury commented on SPARK-21022:


Ah ok, that makes sense for `foreachPartition`. And wouldn't you know, I 
retried my tests with `foreach`, and they _do_ throw now. I swear they weren't 
this morning :S

Anyway, it looks like this isn't a bug after all. Thanks for the confirmation.

> RDD.foreach swallows exceptions
> ---
>
> Key: SPARK-21022
> URL: https://issues.apache.org/jira/browse/SPARK-21022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Colin Woodbury
>Assignee: Shixiong Zhu
>Priority: Minor
>
> A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown 
> inside its closure, but not if the exception was thrown earlier in the call 
> chain. An example:
> {code:none}
>  package examples
>  import org.apache.spark._
>  object Shpark {
>def main(args: Array[String]) {
>  implicit val sc: SparkContext = new SparkContext(
>new SparkConf().setMaster("local[*]").setAppName("blahfoobar")
>  )
>  /* DOESN'T THROW 
> 
>  sc.parallelize(0 until 1000) 
> 
>.foreachPartition { _.map { i =>   
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}} 
> 
>   */
>  /* DOESN'T THROW, nor does anything print.   
> 
>   * Commenting out the exception runs the prints. 
> 
>   * (i.e. `foreach` is sufficient to "run" an RDD)
> 
>  sc.parallelize(0 until 10)   
> 
>.foreach({ i =>
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}) 
> 
>   */
>  /* Throws! */
>  sc.parallelize(0 until 10)
>.map({ i =>
>  println("BEFORE THROW")
>  throw new Exception("Testing exception handling")
>  i
>})
>.foreach(i => println(i))
>  println("JOB DONE!")
>  System.in.read
>  sc.stop()
>}
>  }
> {code}
> When exceptions are swallowed, the jobs don't seem to fail, and the driver 
> exits normally. When one _is_ thrown, as in the last example, the exception 
> successfully rises up to the driver and can be caught with try/catch.
> The expected behaviour is for exceptions in `foreach` to throw and crash the 
> driver, as they would with `map`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21024) CSV parse mode handles Univocity parser exceptions

2017-06-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043310#comment-16043310
 ] 

Xiao Li commented on SPARK-21024:
-

Yes! We should fix it. Thanks!

> CSV parse mode handles Univocity parser exceptions
> --
>
> Key: SPARK-21024
> URL: https://issues.apache.org/jira/browse/SPARK-21024
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The current master cannot skip the illegal records that Univocity parsers:
> This comes from the spark-user mailing list:
> https://www.mail-archive.com/user@spark.apache.org/msg63985.html
> {code}
> scala> Seq("0,1", "0,1,2,3").toDF().write.text("/Users/maropu/Desktop/data")
> scala> val df = spark.read.format("csv").schema("a int, b 
> int").option("maxColumns", "3").load("/Users/maropu/Desktop/data")
> scala> df.show
> com.univocity.parsers.common.TextParsingException: 
> java.lang.ArrayIndexOutOfBoundsException - 3
> Hint: Number of columns processed may have exceeded limit of 3 columns. Use 
> settings.setMaxColumns(int) to define the maximum number of columns your 
> input can have
> Ensure your configuration is correct, with delimiters, quotes and escape 
> sequences that match the input format you are trying to parse
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> ...
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> at 
> com.univocity.parsers.common.AbstractParser.handleEOF(AbstractParser.java:195)
> at 
> com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:544)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:191)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308)
> at 
> org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:60)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> ...
> {code}
> We could easily fix this like: 
> https://github.com/apache/spark/compare/master...maropu:HandleExceptionInParser



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21022) RDD.foreach swallows exceptions

2017-06-08 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21022:
-
Comment: was deleted

(was: ~~Good catch...~~)

> RDD.foreach swallows exceptions
> ---
>
> Key: SPARK-21022
> URL: https://issues.apache.org/jira/browse/SPARK-21022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Colin Woodbury
>Assignee: Shixiong Zhu
>Priority: Minor
>
> A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown 
> inside its closure, but not if the exception was thrown earlier in the call 
> chain. An example:
> {code:none}
>  package examples
>  import org.apache.spark._
>  object Shpark {
>def main(args: Array[String]) {
>  implicit val sc: SparkContext = new SparkContext(
>new SparkConf().setMaster("local[*]").setAppName("blahfoobar")
>  )
>  /* DOESN'T THROW 
> 
>  sc.parallelize(0 until 1000) 
> 
>.foreachPartition { _.map { i =>   
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}} 
> 
>   */
>  /* DOESN'T THROW, nor does anything print.   
> 
>   * Commenting out the exception runs the prints. 
> 
>   * (i.e. `foreach` is sufficient to "run" an RDD)
> 
>  sc.parallelize(0 until 10)   
> 
>.foreach({ i =>
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}) 
> 
>   */
>  /* Throws! */
>  sc.parallelize(0 until 10)
>.map({ i =>
>  println("BEFORE THROW")
>  throw new Exception("Testing exception handling")
>  i
>})
>.foreach(i => println(i))
>  println("JOB DONE!")
>  System.in.read
>  sc.stop()
>}
>  }
> {code}
> When exceptions are swallowed, the jobs don't seem to fail, and the driver 
> exits normally. When one _is_ thrown, as in the last example, the exception 
> successfully rises up to the driver and can be caught with try/catch.
> The expected behaviour is for exceptions in `foreach` to throw and crash the 
> driver, as they would with `map`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21022) RDD.foreach swallows exceptions

2017-06-08 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043266#comment-16043266
 ] 

Shixiong Zhu commented on SPARK-21022:
--

Wait. I also checked `foreach` method. It does throw the exception. It's 
probably just you missed the exception due to lots of logs output?

> RDD.foreach swallows exceptions
> ---
>
> Key: SPARK-21022
> URL: https://issues.apache.org/jira/browse/SPARK-21022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Colin Woodbury
>Assignee: Shixiong Zhu
>Priority: Minor
>
> A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown 
> inside its closure, but not if the exception was thrown earlier in the call 
> chain. An example:
> {code:none}
>  package examples
>  import org.apache.spark._
>  object Shpark {
>def main(args: Array[String]) {
>  implicit val sc: SparkContext = new SparkContext(
>new SparkConf().setMaster("local[*]").setAppName("blahfoobar")
>  )
>  /* DOESN'T THROW 
> 
>  sc.parallelize(0 until 1000) 
> 
>.foreachPartition { _.map { i =>   
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}} 
> 
>   */
>  /* DOESN'T THROW, nor does anything print.   
> 
>   * Commenting out the exception runs the prints. 
> 
>   * (i.e. `foreach` is sufficient to "run" an RDD)
> 
>  sc.parallelize(0 until 10)   
> 
>.foreach({ i =>
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}) 
> 
>   */
>  /* Throws! */
>  sc.parallelize(0 until 10)
>.map({ i =>
>  println("BEFORE THROW")
>  throw new Exception("Testing exception handling")
>  i
>})
>.foreach(i => println(i))
>  println("JOB DONE!")
>  System.in.read
>  sc.stop()
>}
>  }
> {code}
> When exceptions are swallowed, the jobs don't seem to fail, and the driver 
> exits normally. When one _is_ thrown, as in the last example, the exception 
> successfully rises up to the driver and can be caught with try/catch.
> The expected behaviour is for exceptions in `foreach` to throw and crash the 
> driver, as they would with `map`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21025) missing data in jsc.union

2017-06-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21025.
---
Resolution: Invalid

Oh, and I realized the mistake here. You put the body of the loop on one line 
and attempted to comment out just one statement, but you comment out all of the 
other statements, including the one that updates resultbuffer.

> missing data in jsc.union
> -
>
> Key: SPARK-21025
> URL: https://issues.apache.org/jira/browse/SPARK-21025
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu 16.04
>Reporter: meng xi
>
> we are using an iterator of RDD for some special data processing, and then 
> using union to rebuild a new RDD. we found the result RDD are often empty or 
> missing most of the data. Here is a simplified code snippet for this bug:
> SparkConf sparkConf = new 
> SparkConf().setAppName("Test").setMaster("local[*]");
> SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
> JavaSparkContext jsc = 
> JavaSparkContext.fromSparkContext(sparkContext);
> JavaRDD src = jsc.parallelize(IntStream.range(0, 
> 3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
> Iterator it = src.toLocalIterator();
> List> rddList = new LinkedList<>();
> List resultBuffer = new LinkedList<>();
> while (it.hasNext()) {
> resultBuffer.add(it.next());
> if (resultBuffer.size() == 1000) {
> JavaRDD rdd = jsc.parallelize(resultBuffer);
> //rdd.count();
> rddList.add(rdd);
> resultBuffer.clear();
> }
> }
> JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
> rddList);
> System.out.println(desc.count());
> this code should duplicate the original RDD, but it just returns an empty 
> RDD. Please note that if I uncomment the rdd.count, it will return the 
> correct result. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20976) Unify Error Messages for FAILFAST mode.

2017-06-08 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-20976.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18196
[https://github.com/apache/spark/pull/18196]

> Unify Error Messages for FAILFAST mode. 
> 
>
> Key: SPARK-20976
> URL: https://issues.apache.org/jira/browse/SPARK-20976
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> Previously, we indicate the job was terminated because of `FAILFAST` mode. 
> {noformat}
> Malformed line in FAILFAST mode: {"a":{, b:3}
> {noformat}
> If possible, we should keep it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21022) RDD.foreach swallows exceptions

2017-06-08 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043223#comment-16043223
 ] 

Shixiong Zhu edited comment on SPARK-21022 at 6/8/17 7:05 PM:
--

~~Good catch...~~


was (Author: zsxwing):
Good catch...

> RDD.foreach swallows exceptions
> ---
>
> Key: SPARK-21022
> URL: https://issues.apache.org/jira/browse/SPARK-21022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Colin Woodbury
>Assignee: Shixiong Zhu
>Priority: Minor
>
> A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown 
> inside its closure, but not if the exception was thrown earlier in the call 
> chain. An example:
> {code:none}
>  package examples
>  import org.apache.spark._
>  object Shpark {
>def main(args: Array[String]) {
>  implicit val sc: SparkContext = new SparkContext(
>new SparkConf().setMaster("local[*]").setAppName("blahfoobar")
>  )
>  /* DOESN'T THROW 
> 
>  sc.parallelize(0 until 1000) 
> 
>.foreachPartition { _.map { i =>   
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}} 
> 
>   */
>  /* DOESN'T THROW, nor does anything print.   
> 
>   * Commenting out the exception runs the prints. 
> 
>   * (i.e. `foreach` is sufficient to "run" an RDD)
> 
>  sc.parallelize(0 until 10)   
> 
>.foreach({ i =>
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}) 
> 
>   */
>  /* Throws! */
>  sc.parallelize(0 until 10)
>.map({ i =>
>  println("BEFORE THROW")
>  throw new Exception("Testing exception handling")
>  i
>})
>.foreach(i => println(i))
>  println("JOB DONE!")
>  System.in.read
>  sc.stop()
>}
>  }
> {code}
> When exceptions are swallowed, the jobs don't seem to fail, and the driver 
> exits normally. When one _is_ thrown, as in the last example, the exception 
> successfully rises up to the driver and can be caught with try/catch.
> The expected behaviour is for exceptions in `foreach` to throw and crash the 
> driver, as they would with `map`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21022) RDD.foreach swallows exceptions

2017-06-08 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043247#comment-16043247
 ] 

Shixiong Zhu commented on SPARK-21022:
--

By the way, `foreachPartition` doesn't have the issue. It's just because 
"Iterator.map" is lazy and you don't consume the Iterator.

> RDD.foreach swallows exceptions
> ---
>
> Key: SPARK-21022
> URL: https://issues.apache.org/jira/browse/SPARK-21022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Colin Woodbury
>Assignee: Shixiong Zhu
>Priority: Minor
>
> A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown 
> inside its closure, but not if the exception was thrown earlier in the call 
> chain. An example:
> {code:none}
>  package examples
>  import org.apache.spark._
>  object Shpark {
>def main(args: Array[String]) {
>  implicit val sc: SparkContext = new SparkContext(
>new SparkConf().setMaster("local[*]").setAppName("blahfoobar")
>  )
>  /* DOESN'T THROW 
> 
>  sc.parallelize(0 until 1000) 
> 
>.foreachPartition { _.map { i =>   
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}} 
> 
>   */
>  /* DOESN'T THROW, nor does anything print.   
> 
>   * Commenting out the exception runs the prints. 
> 
>   * (i.e. `foreach` is sufficient to "run" an RDD)
> 
>  sc.parallelize(0 until 10)   
> 
>.foreach({ i =>
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}) 
> 
>   */
>  /* Throws! */
>  sc.parallelize(0 until 10)
>.map({ i =>
>  println("BEFORE THROW")
>  throw new Exception("Testing exception handling")
>  i
>})
>.foreach(i => println(i))
>  println("JOB DONE!")
>  System.in.read
>  sc.stop()
>}
>  }
> {code}
> When exceptions are swallowed, the jobs don't seem to fail, and the driver 
> exits normally. When one _is_ thrown, as in the last example, the exception 
> successfully rises up to the driver and can be caught with try/catch.
> The expected behaviour is for exceptions in `foreach` to throw and crash the 
> driver, as they would with `map`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21025) missing data in jsc.union

2017-06-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043229#comment-16043229
 ] 

Sean Owen commented on SPARK-21025:
---

It's not clear why you're parallelizing 'src' to begin with, or why this is a 
simple reproduction. What are the values and sizes of all the intermediate 
structures? something else is going wrong here.

> missing data in jsc.union
> -
>
> Key: SPARK-21025
> URL: https://issues.apache.org/jira/browse/SPARK-21025
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0, 2.1.1
> Environment: Ubuntu 16.04
>Reporter: meng xi
>
> we are using an iterator of RDD for some special data processing, and then 
> using union to rebuild a new RDD. we found the result RDD are often empty or 
> missing most of the data. Here is a simplified code snippet for this bug:
> SparkConf sparkConf = new 
> SparkConf().setAppName("Test").setMaster("local[*]");
> SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
> JavaSparkContext jsc = 
> JavaSparkContext.fromSparkContext(sparkContext);
> JavaRDD src = jsc.parallelize(IntStream.range(0, 
> 3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
> Iterator it = src.toLocalIterator();
> List> rddList = new LinkedList<>();
> List resultBuffer = new LinkedList<>();
> while (it.hasNext()) {
> resultBuffer.add(it.next());
> if (resultBuffer.size() == 1000) {
> JavaRDD rdd = jsc.parallelize(resultBuffer);
> //rdd.count();
> rddList.add(rdd);
> resultBuffer.clear();
> }
> }
> JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
> rddList);
> System.out.println(desc.count());
> this code should duplicate the original RDD, but it just returns an empty 
> RDD. Please note that if I uncomment the rdd.count, it will return the 
> correct result. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21022) RDD.foreach swallows exceptions

2017-06-08 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-21022:


Assignee: Shixiong Zhu

> RDD.foreach swallows exceptions
> ---
>
> Key: SPARK-21022
> URL: https://issues.apache.org/jira/browse/SPARK-21022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Colin Woodbury
>Assignee: Shixiong Zhu
>Priority: Minor
>
> A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown 
> inside its closure, but not if the exception was thrown earlier in the call 
> chain. An example:
> {code:none}
>  package examples
>  import org.apache.spark._
>  object Shpark {
>def main(args: Array[String]) {
>  implicit val sc: SparkContext = new SparkContext(
>new SparkConf().setMaster("local[*]").setAppName("blahfoobar")
>  )
>  /* DOESN'T THROW 
> 
>  sc.parallelize(0 until 1000) 
> 
>.foreachPartition { _.map { i =>   
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}} 
> 
>   */
>  /* DOESN'T THROW, nor does anything print.   
> 
>   * Commenting out the exception runs the prints. 
> 
>   * (i.e. `foreach` is sufficient to "run" an RDD)
> 
>  sc.parallelize(0 until 10)   
> 
>.foreach({ i =>
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}) 
> 
>   */
>  /* Throws! */
>  sc.parallelize(0 until 10)
>.map({ i =>
>  println("BEFORE THROW")
>  throw new Exception("Testing exception handling")
>  i
>})
>.foreach(i => println(i))
>  println("JOB DONE!")
>  System.in.read
>  sc.stop()
>}
>  }
> {code}
> When exceptions are swallowed, the jobs don't seem to fail, and the driver 
> exits normally. When one _is_ thrown, as in the last example, the exception 
> successfully rises up to the driver and can be caught with try/catch.
> The expected behaviour is for exceptions in `foreach` to throw and crash the 
> driver, as they would with `map`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21022) RDD.foreach swallows exceptions

2017-06-08 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043223#comment-16043223
 ] 

Shixiong Zhu commented on SPARK-21022:
--

Good catch...

> RDD.foreach swallows exceptions
> ---
>
> Key: SPARK-21022
> URL: https://issues.apache.org/jira/browse/SPARK-21022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Colin Woodbury
>Priority: Minor
>
> A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown 
> inside its closure, but not if the exception was thrown earlier in the call 
> chain. An example:
> {code:none}
>  package examples
>  import org.apache.spark._
>  object Shpark {
>def main(args: Array[String]) {
>  implicit val sc: SparkContext = new SparkContext(
>new SparkConf().setMaster("local[*]").setAppName("blahfoobar")
>  )
>  /* DOESN'T THROW 
> 
>  sc.parallelize(0 until 1000) 
> 
>.foreachPartition { _.map { i =>   
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}} 
> 
>   */
>  /* DOESN'T THROW, nor does anything print.   
> 
>   * Commenting out the exception runs the prints. 
> 
>   * (i.e. `foreach` is sufficient to "run" an RDD)
> 
>  sc.parallelize(0 until 10)   
> 
>.foreach({ i =>
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}) 
> 
>   */
>  /* Throws! */
>  sc.parallelize(0 until 10)
>.map({ i =>
>  println("BEFORE THROW")
>  throw new Exception("Testing exception handling")
>  i
>})
>.foreach(i => println(i))
>  println("JOB DONE!")
>  System.in.read
>  sc.stop()
>}
>  }
> {code}
> When exceptions are swallowed, the jobs don't seem to fail, and the driver 
> exits normally. When one _is_ thrown, as in the last example, the exception 
> successfully rises up to the driver and can be caught with try/catch.
> The expected behaviour is for exceptions in `foreach` to throw and crash the 
> driver, as they would with `map`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043209#comment-16043209
 ] 

Marcelo Vanzin commented on SPARK-21023:


Then your best bet is a new command line option that implements the behavior 
you want.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043191#comment-16043191
 ] 

Lantao Jin commented on SPARK-21023:


I think {{--conf}} couldn't help this. Because from the view of infra team, 
they hope their cluster level configuration can take effect in all jobs if no 
customer overwrite it. Does it make sense if we add a switch val in 
spark-env.sh?

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043179#comment-16043179
 ] 

Lantao Jin commented on SPARK-21023:


{quote}
it may break existing applications
{quote}
I really know the risk and hope to do the right thing. Need find a way to keep 
current behavior and can easily control the behavior follow we want.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043176#comment-16043176
 ] 

Marcelo Vanzin commented on SPARK-21023:


bq.  When and where the new config option be set? 

That's what makes that option awkward. It would have to be set in the user 
config or in the command line with {{\-\-conf}}. So it's not that much 
different from a new command line option, other than it avoids a new command 
line option.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043172#comment-16043172
 ] 

Lantao Jin commented on SPARK-21023:


{quote}
Another option is to have a config option
{quote}
Oh, sorry. {{--properties-file}} can skip to load the default configuration 
file. When and where the new config option be set? In spark-env?

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043168#comment-16043168
 ] 

Marcelo Vanzin commented on SPARK-21023:


bq. The purpose is making the default configuration loaded anytime.

We all understand the purpose. But it breaks the existing behavior, so it may 
break existing applications. That makes your solution, as presented, a 
non-starter.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21025) missing data in jsc.union

2017-06-08 Thread meng xi (JIRA)
meng xi created SPARK-21025:
---

 Summary: missing data in jsc.union
 Key: SPARK-21025
 URL: https://issues.apache.org/jira/browse/SPARK-21025
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.1.1, 2.1.0
 Environment: Ubuntu 16.04
Reporter: meng xi


we are using an iterator of RDD for some special data processing, and then 
using union to rebuild a new RDD. we found the result RDD are often empty or 
missing most of the data. Here is a simplified code snippet for this bug:

SparkConf sparkConf = new 
SparkConf().setAppName("Test").setMaster("local[*]");
SparkContext sparkContext = SparkContext.getOrCreate(sparkConf);
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkContext);
JavaRDD src = jsc.parallelize(IntStream.range(0, 
3000).mapToObj(i -> new String[10]).collect(Collectors.toList()));
Iterator it = src.toLocalIterator();
List> rddList = new LinkedList<>();
List resultBuffer = new LinkedList<>();
while (it.hasNext()) {
resultBuffer.add(it.next());
if (resultBuffer.size() == 1000) {
JavaRDD rdd = jsc.parallelize(resultBuffer);
//rdd.count();
rddList.add(rdd);
resultBuffer.clear();
}
}
JavaRDD desc = jsc.union(jsc.parallelize(resultBuffer), 
rddList);
System.out.println(desc.count());

this code should duplicate the original RDD, but it just returns an empty RDD. 
Please note that if I uncomment the rdd.count, it will return the correct 
result. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043164#comment-16043164
 ] 

Lantao Jin commented on SPARK-21023:


{quote}
Another option is to have a config option
{quote}
LGTM

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043160#comment-16043160
 ] 

Lantao Jin commented on SPARK-21023:


*The purpose is making the default configuration loaded anytime.* Because the 
parameters app developer set always less the it should be.
For example: App dev set spark.executor.instances=100 in their properties file. 
But one month later the spark version upgrade to a new version by infra team 
and dynamic resource allocation enabled. But the old job can not load the new 
parameters so no dynamic feature enable for it. It still causes more challenge 
to control cluster for infra team and bad performance for app team.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043160#comment-16043160
 ] 

Lantao Jin edited comment on SPARK-21023 at 6/8/17 6:06 PM:


The purpose is making the default configuration loaded anytime. Because the 
parameters app developer set always less the it should be.
For example: App dev set spark.executor.instances=100 in their properties file. 
But one month later the spark version upgrade to a new version by infra team 
and dynamic resource allocation enabled. But the old job can not load the new 
parameters so no dynamic feature enable for it. It still causes more challenge 
to control cluster for infra team and bad performance for app team.


was (Author: cltlfcjin):
*The purpose is making the default configuration loaded anytime.* Because the 
parameters app developer set always less the it should be.
For example: App dev set spark.executor.instances=100 in their properties file. 
But one month later the spark version upgrade to a new version by infra team 
and dynamic resource allocation enabled. But the old job can not load the new 
parameters so no dynamic feature enable for it. It still causes more challenge 
to control cluster for infra team and bad performance for app team.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20971) Purge the metadata log for FileStreamSource

2017-06-08 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043150#comment-16043150
 ] 

Shixiong Zhu commented on SPARK-20971:
--

 FileStreamSource saves the seen files in the disk/HDFS, we can use the similar 
way like org.apache.spark.sql.execution.streaming.FileStreamSource.SeenFilesMap 
to purge the file entries.

> Purge the metadata log for FileStreamSource
> ---
>
> Key: SPARK-20971
> URL: https://issues.apache.org/jira/browse/SPARK-20971
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Shixiong Zhu
>
> Currently 
> [FileStreamSource.commit|https://github.com/apache/spark/blob/16186cdcbce1a2ec8f839c550e6b571bf5dc2692/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L258]
>  is empty. We can delete unused metadata logs in this method to reduce the 
> size of log files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043132#comment-16043132
 ] 

Marcelo Vanzin commented on SPARK-21023:


Another option is to have a config option that controls whether the default 
file is loaded on top of {{--properties-file}}. If avoids adding a new command 
line argument, but is a little more awkward to use.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043131#comment-16043131
 ] 

Marcelo Vanzin commented on SPARK-21023:


bq. I suggest to change the current behavior

Yes, and we're saying that should not be done, because it's a change in 
semantics that might cause breakages in people's workflows. Regardless of 
whether the new behavior is better or worse, implementing it is a breaking 
change.

If you want this you need to implement it in a way that does not change the 
current behavior - e.g., as a new command line argument instead of modifying 
the behavior of the existing one.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043124#comment-16043124
 ] 

Lantao Jin commented on SPARK-21023:


[~vanzin] I suggest to change the current behavior and offer a document to 
illustrate this. --properties-file will overwrite the args which are set in  
spark-defaults.conf first. It's equivalent to set dozens of {{--conf k=v}} in 
command line. Please review and open for any ideas.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043124#comment-16043124
 ] 

Lantao Jin edited comment on SPARK-21023 at 6/8/17 5:45 PM:


[~vanzin] I suggest to change the current behavior and offer a document to 
illustrate this. \-\-properties-file will overwrite the args which are set in  
spark-defaults.conf first. It's equivalent to set dozens of {{--conf k=v}} in 
command line. Please review and open for any ideas.


was (Author: cltlfcjin):
[~vanzin] I suggest to change the current behavior and offer a document to 
illustrate this. --properties-file will overwrite the args which are set in  
spark-defaults.conf first. It's equivalent to set dozens of {{--conf k=v}} in 
command line. Please review and open for any ideas.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21023:


Assignee: Apache Spark

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043097#comment-16043097
 ] 

Apache Spark commented on SPARK-21023:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/18243

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21023:


Assignee: (was: Apache Spark)

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043053#comment-16043053
 ] 

Marcelo Vanzin edited comment on SPARK-21023 at 6/8/17 5:11 PM:


I thought we had an issue for adding a user-specific config file that is loaded 
on top of the defaults, but I can't find it. In any case, changing the current 
behavior is not really desired, but you can add this as a new feature without 
changing the current behavior.


was (Author: vanzin):
I thought we have an issue for adding a user-specific config file that is 
loaded on top of the defaults, but I can't find it. In any case, changing the 
current behavior is not really desired, but you can add this as a new feature 
without changing the current behavior.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043053#comment-16043053
 ] 

Marcelo Vanzin commented on SPARK-21023:


I thought we have an issue for adding a user-specific config file that is 
loaded on top of the defaults, but I can't find it. In any case, changing the 
current behavior is not really desired, but you can add this as a new feature 
without changing the current behavior.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043036#comment-16043036
 ] 

Sean Owen commented on SPARK-21023:
---

Maybe, but it would be a behavior change now. There are equal counter-arguments 
for the current behavior.

> Ignore to load default properties file is not a good choice from the 
> perspective of system
> --
>
> Key: SPARK-21023
> URL: https://issues.apache.org/jira/browse/SPARK-21023
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.1
>Reporter: Lantao Jin
>Priority: Minor
>
> The default properties file {{spark-defaults.conf}} shouldn't be ignore to 
> load even though the submit arg {{--properties-file}} is set. The reasons are 
> very easy to see:
> * Infrastructure team need continually update the {{spark-defaults.conf}} 
> when they want set something as default for entire cluster as a tuning 
> purpose.
> * Application developer only want to override the parameters they really want 
> rather than others they even doesn't know (Set by infrastructure team).
> * The purpose of using {{\-\-properties-file}} from most of application 
> developers is to avoid setting dozens of {{--conf k=v}}. But if 
> {{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.
> All this caused by below codes:
> {code}
>   private Properties loadPropertiesFile() throws IOException {
> Properties props = new Properties();
> File propsFile;
> if (propertiesFile != null) {
> // default conf property file will not be loaded when app developer use 
> --properties-file as a submit args
>   propsFile = new File(propertiesFile);
>   checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
> propertiesFile);
> } else {
>   propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
> }
> //...
> return props;
>   }
> {code}
> I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2017-06-08 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16043030#comment-16043030
 ] 

Marcelo Vanzin commented on SPARK-19185:


I merged Mark's patch above to master and branch-2.2, but it's just a 
work-around, not a fix, so I'll leave the bug open (and with no "fix version").

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>  Labels: streaming, windowing
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org

[jira] [Comment Edited] (SPARK-21001) Staging folders from Hive table are not being cleared.

2017-06-08 Thread Ajay Cherukuri (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042981#comment-16042981
 ] 

Ajay Cherukuri edited comment on SPARK-21001 at 6/8/17 4:51 PM:


Hi Liang, do you mean 2.0.3?


was (Author: ajaycherukuri):
do you mean 2.0.3?

> Staging folders from Hive table are not being cleared.
> --
>
> Key: SPARK-21001
> URL: https://issues.apache.org/jira/browse/SPARK-21001
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Ajay Cherukuri
>
> Staging folders that were being created as a part of Data loading to Hive 
> table by using spark job, are not cleared.
> Staging folder are remaining in Hive External table folders even after Spark 
> job is completed.
> This is the same issue mentioned in the 
> ticket:https://issues.apache.org/jira/browse/SPARK-18372
> This ticket says the issues was resolved in 1.6.4. But, now i found that it's 
> still existing on 2.0.2.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21013) Spark History Server does not show the logs of completed Yarn Jobs

2017-06-08 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21013.

Resolution: Duplicate

You need the MR history server for aggregated logs to show. This is already 
explained in Spark's documentation.

> Spark History Server does not show the logs of completed Yarn Jobs
> --
>
> Key: SPARK-21013
> URL: https://issues.apache.org/jira/browse/SPARK-21013
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.1, 2.0.1, 2.1.0
>Reporter: Hari Ck
>Priority: Minor
>  Labels: historyserver, ui
>
> I am facing issue when accessing the container logs of a completed Spark 
> (Yarn) application from the History Server.
> Repro Steps:
> 1) Run the spark-shell in yarn client mode. Or run Pi job in Yarn mode. 
> 2) Once the job is completed, (in the case of spark shell, exit after doing 
> some simple operations), try to access the STDOUT or STDERR logs of the 
> application from the Executors tab in the Spark History Server UI. 
> 3) If yarn log aggregation is enabled, then logs won't be available in node 
> manager's log location.  But history Server is trying to access the logs from 
> the nodemanager's log location giving below error in the UI:
> Failed redirect for container_e31_1496881617682_0003_01_02
> ResourceManager
> RM Home
> NodeManager
> Tools
> Failed while trying to construct the redirect url to the log server. Log 
> Server url may not be configured
> java.lang.Exception: Unknown container. Container either has not started or 
> has already completed or doesn't belong to this node at all.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21001) Staging folders from Hive table are not being cleared.

2017-06-08 Thread Ajay Cherukuri (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042981#comment-16042981
 ] 

Ajay Cherukuri commented on SPARK-21001:


do you mean 2.0.3?

> Staging folders from Hive table are not being cleared.
> --
>
> Key: SPARK-21001
> URL: https://issues.apache.org/jira/browse/SPARK-21001
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Ajay Cherukuri
>
> Staging folders that were being created as a part of Data loading to Hive 
> table by using spark job, are not cleared.
> Staging folder are remaining in Hive External table folders even after Spark 
> job is completed.
> This is the same issue mentioned in the 
> ticket:https://issues.apache.org/jira/browse/SPARK-18372
> This ticket says the issues was resolved in 1.6.4. But, now i found that it's 
> still existing on 2.0.2.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21024) CSV parse mode handles Univocity parser exceptions

2017-06-08 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042925#comment-16042925
 ] 

Takeshi Yamamuro commented on SPARK-21024:
--

Is it worth fixing this? (I feel it is a kind of conrner cases..) cc: 
[~smilegator] [~hyukjin.kwon]

> CSV parse mode handles Univocity parser exceptions
> --
>
> Key: SPARK-21024
> URL: https://issues.apache.org/jira/browse/SPARK-21024
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The current master cannot skip the illegal records that Univocity parsers:
> This comes from the spark-user mailing list:
> https://www.mail-archive.com/user@spark.apache.org/msg63985.html
> {code}
> scala> Seq("0,1", "0,1,2,3").toDF().write.text("/Users/maropu/Desktop/data")
> scala> val df = spark.read.format("csv").schema("a int, b 
> int").option("maxColumns", "3").load("/Users/maropu/Desktop/data")
> scala> df.show
> com.univocity.parsers.common.TextParsingException: 
> java.lang.ArrayIndexOutOfBoundsException - 3
> Hint: Number of columns processed may have exceeded limit of 3 columns. Use 
> settings.setMaxColumns(int) to define the maximum number of columns your 
> input can have
> Ensure your configuration is correct, with delimiters, quotes and escape 
> sequences that match the input format you are trying to parse
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> ...
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> at 
> com.univocity.parsers.common.AbstractParser.handleEOF(AbstractParser.java:195)
> at 
> com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:544)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:191)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308)
> at 
> org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:60)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> ...
> {code}
> We could easily fix this like: 
> https://github.com/apache/spark/compare/master...maropu:HandleExceptionInParser



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21024) CSV parse mode handles Univocity parser exceptions

2017-06-08 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-21024:


 Summary: CSV parse mode handles Univocity parser exceptions
 Key: SPARK-21024
 URL: https://issues.apache.org/jira/browse/SPARK-21024
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Takeshi Yamamuro
Priority: Minor


The current master cannot skip the illegal records that Univocity parsers:
This comes from the spark-user mailing list:
https://www.mail-archive.com/user@spark.apache.org/msg63985.html

{code}
scala> Seq("0,1", "0,1,2,3").toDF().write.text("/Users/maropu/Desktop/data")
scala> val df = spark.read.format("csv").schema("a int, b 
int").option("maxColumns", "3").load("/Users/maropu/Desktop/data")
scala> df.show

com.univocity.parsers.common.TextParsingException: 
java.lang.ArrayIndexOutOfBoundsException - 3
Hint: Number of columns processed may have exceeded limit of 3 columns. Use 
settings.setMaxColumns(int) to define the maximum number of columns your input 
can have
Ensure your configuration is correct, with delimiters, quotes and escape 
sequences that match the input format you are trying to parse
Parser Configuration: CsvParserSettings:
Auto configuration enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Empty value=null
Escape unquoted values=false
...

at 
com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
at 
com.univocity.parsers.common.AbstractParser.handleEOF(AbstractParser.java:195)
at 
com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:544)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:191)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308)
at 
org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:60)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
...
{code}

We could easily fix this like: 
https://github.com/apache/spark/compare/master...maropu:HandleExceptionInParser



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21023) Ignore to load default properties file is not a good choice from the perspective of system

2017-06-08 Thread Lantao Jin (JIRA)
Lantao Jin created SPARK-21023:
--

 Summary: Ignore to load default properties file is not a good 
choice from the perspective of system
 Key: SPARK-21023
 URL: https://issues.apache.org/jira/browse/SPARK-21023
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 2.1.1
Reporter: Lantao Jin
Priority: Minor


The default properties file {{spark-defaults.conf}} shouldn't be ignore to load 
even though the submit arg {{--properties-file}} is set. The reasons are very 
easy to see:
* Infrastructure team need continually update the {{spark-defaults.conf}} when 
they want set something as default for entire cluster as a tuning purpose.
* Application developer only want to override the parameters they really want 
rather than others they even doesn't know (Set by infrastructure team).
* The purpose of using {{\-\-properties-file}} from most of application 
developers is to avoid setting dozens of {{--conf k=v}}. But if 
{{spark-defaults.conf}} is ignored, the behaviour becomes unexpected finally.

All this caused by below codes:
{code}
  private Properties loadPropertiesFile() throws IOException {
Properties props = new Properties();
File propsFile;
if (propertiesFile != null) {
// default conf property file will not be loaded when app developer use 
--properties-file as a submit args
  propsFile = new File(propertiesFile);
  checkArgument(propsFile.isFile(), "Invalid properties file '%s'.", 
propertiesFile);
} else {
  propsFile = new File(getConfDir(), DEFAULT_PROPERTIES_FILE);
}

//...

return props;
  }
{code}

I can offer a patch to fix it if you think it make sense.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21022) RDD.foreach swallows exceptions

2017-06-08 Thread Colin Woodbury (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Woodbury updated SPARK-21022:
---
Summary: RDD.foreach swallows exceptions  (was: foreach swallows exceptions)

> RDD.foreach swallows exceptions
> ---
>
> Key: SPARK-21022
> URL: https://issues.apache.org/jira/browse/SPARK-21022
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Colin Woodbury
>Priority: Minor
>
> A `RDD.foreach` or `RDD.foreachPartition` call will swallow Exceptions thrown 
> inside its closure, but not if the exception was thrown earlier in the call 
> chain. An example:
> {code:none}
>  package examples
>  import org.apache.spark._
>  object Shpark {
>def main(args: Array[String]) {
>  implicit val sc: SparkContext = new SparkContext(
>new SparkConf().setMaster("local[*]").setAppName("blahfoobar")
>  )
>  /* DOESN'T THROW 
> 
>  sc.parallelize(0 until 1000) 
> 
>.foreachPartition { _.map { i =>   
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}} 
> 
>   */
>  /* DOESN'T THROW, nor does anything print.   
> 
>   * Commenting out the exception runs the prints. 
> 
>   * (i.e. `foreach` is sufficient to "run" an RDD)
> 
>  sc.parallelize(0 until 10)   
> 
>.foreach({ i =>
> 
>  println("BEFORE THROW")  
> 
>  throw new Exception("Testing exception handling")
> 
>  println(i)   
> 
>}) 
> 
>   */
>  /* Throws! */
>  sc.parallelize(0 until 10)
>.map({ i =>
>  println("BEFORE THROW")
>  throw new Exception("Testing exception handling")
>  i
>})
>.foreach(i => println(i))
>  println("JOB DONE!")
>  System.in.read
>  sc.stop()
>}
>  }
> {code}
> When exceptions are swallowed, the jobs don't seem to fail, and the driver 
> exits normally. When one _is_ thrown, as in the last example, the exception 
> successfully rises up to the driver and can be caught with try/catch.
> The expected behaviour is for exceptions in `foreach` to throw and crash the 
> driver, as they would with `map`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >