[jira] [Created] (SPARK-28332) SQLMetric wrong initValue

2019-07-10 Thread Song Jun (JIRA)
Song Jun created SPARK-28332:


 Summary: SQLMetric wrong initValue 
 Key: SPARK-28332
 URL: https://issues.apache.org/jira/browse/SPARK-28332
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Song Jun


Currently SQLMetrics.createSizeMetric create a SQLMetric with initValue set to 
-1.

If there is a ShuffleMapStage with lots of Tasks which read 0 bytes data, these 
tasks will send the metric(the metric value still be the initValue with -1) to 
Driver,  then Driver do metric merge for this Stage in 
DAGScheduler.updateAccumulators, this will cause the merged metric value of 
this Stage set to be a negative value. 

This is incorrect, we should set the initValue to 0 .

Another same case in SQLMetrics.createTimingMetric.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27227) Spark Runtime Filter

2019-05-06 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833558#comment-16833558
 ] 

Song Jun edited comment on SPARK-27227 at 5/6/19 7:32 AM:
--

[~cloud_fan] [~smilegator] could you please help to review this SPIP? thanks 
very much!



was (Author: windpiger):
[~cloud_fan] [~LI,Xiao] could you please help to review this SPIP? thanks very 
much!


> Spark Runtime Filter
> 
>
> Key: SPARK-27227
> URL: https://issues.apache.org/jira/browse/SPARK-27227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Priority: Major
>
> When we equi-join one big table with a smaller table, we can collect some 
> statistics from the smaller table side, and use it to the scan of big table 
> to do partition prune or data filter before execute the join.
> This can significantly improve SQL perfermance.
> For a simple example:
> select * from A, B where A.a = B.b
> A is big table ,B is small table.
> There are two scenarios:
> 1. A.a is a partition column of table A
>we can collect  all the values  of B.b, and send it to table A to do 
>partition prune on A.a.
> 2. A.a is not a partition column of table A
>   we can collect real-time some statistics(such as min/max/bloomfilter) of 
> B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to 
> table A to do filter on A.a.
>   Addititionaly, if a more complex query select * from A join (select * from 
> B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as 
> min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) 
> from X)
> Above two scenarios, we can filter out lots of data by partition prune or 
> data filter, thus we can imporve perfermance.
> 10TB TPC-DS  gain about 35%  improvement in our test.
> I will submit a SPIP later.
> SPIP: 
> https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27227) Spark Runtime Filter

2019-05-06 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833558#comment-16833558
 ] 

Song Jun commented on SPARK-27227:
--

[~cloud_fan] [~LI,Xiao] could you please help to review this SPIP? thanks very 
much!


> Spark Runtime Filter
> 
>
> Key: SPARK-27227
> URL: https://issues.apache.org/jira/browse/SPARK-27227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Priority: Major
>
> When we equi-join one big table with a smaller table, we can collect some 
> statistics from the smaller table side, and use it to the scan of big table 
> to do partition prune or data filter before execute the join.
> This can significantly improve SQL perfermance.
> For a simple example:
> select * from A, B where A.a = B.b
> A is big table ,B is small table.
> There are two scenarios:
> 1. A.a is a partition column of table A
>we can collect  all the values  of B.b, and send it to table A to do 
>partition prune on A.a.
> 2. A.a is not a partition column of table A
>   we can collect real-time some statistics(such as min/max/bloomfilter) of 
> B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to 
> table A to do filter on A.a.
>   Addititionaly, if a more complex query select * from A join (select * from 
> B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as 
> min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) 
> from X)
> Above two scenarios, we can filter out lots of data by partition prune or 
> data filter, thus we can imporve perfermance.
> 10TB TPC-DS  gain about 35%  improvement in our test.
> I will submit a SPIP later.
> SPIP: 
> https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27227) Spark Runtime Filter

2019-05-06 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-27227:
-
Description: 
When we equi-join one big table with a smaller table, we can collect some 
statistics from the smaller table side, and use it to the scan of big table to 
do partition prune or data filter before execute the join.
This can significantly improve SQL perfermance.

For a simple example:
select * from A, B where A.a = B.b
A is big table ,B is small table.

There are two scenarios:
1. A.a is a partition column of table A
   we can collect  all the values  of B.b, and send it to table A to do 
   partition prune on A.a.
2. A.a is not a partition column of table A
  we can collect real-time some statistics(such as min/max/bloomfilter) of B.b 
by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to table 
A to do filter on A.a.
  Addititionaly, if a more complex query select * from A join (select * from B 
where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as 
min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) from 
X)

Above two scenarios, we can filter out lots of data by partition prune or data 
filter, thus we can imporve perfermance.

10TB TPC-DS  gain about 35%  improvement in our test.

I will submit a SPIP later.

SPIP: 
https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt

  was:
When we equi-join one big table with a smaller table, we can collect some 
statistics from the smaller table side, and use it to the scan of big table to 
do partition prune or data filter before execute the join.
This can significantly improve SQL perfermance.

For a simple example:
select * from A, B where A.a = B.b
A is big table ,B is small table.

There are two scenarios:
1. A.a is a partition column of table A
   we can collect  all the values  of B.b, and send it to table A to do 
   partition prune on A.a.
2. A.a is not a partition column of table A
  we can collect real-time some statistics(such as min/max/bloomfilter) of B.b 
by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to table 
A to do filter on A.a.
  Addititionaly, if a more complex query select * from A join (select * from B 
where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as 
min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) from 
X)

Above two scenarios, we can filter out lots of data by partition prune or data 
filter, thus we can imporve perfermance.

10TB TPC-DS  gain about 35%  improvement in our test.

I will submit a SPIP later.


> Spark Runtime Filter
> 
>
> Key: SPARK-27227
> URL: https://issues.apache.org/jira/browse/SPARK-27227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Priority: Major
>
> When we equi-join one big table with a smaller table, we can collect some 
> statistics from the smaller table side, and use it to the scan of big table 
> to do partition prune or data filter before execute the join.
> This can significantly improve SQL perfermance.
> For a simple example:
> select * from A, B where A.a = B.b
> A is big table ,B is small table.
> There are two scenarios:
> 1. A.a is a partition column of table A
>we can collect  all the values  of B.b, and send it to table A to do 
>partition prune on A.a.
> 2. A.a is not a partition column of table A
>   we can collect real-time some statistics(such as min/max/bloomfilter) of 
> B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to 
> table A to do filter on A.a.
>   Addititionaly, if a more complex query select * from A join (select * from 
> B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as 
> min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) 
> from X)
> Above two scenarios, we can filter out lots of data by partition prune or 
> data filter, thus we can imporve perfermance.
> 10TB TPC-DS  gain about 35%  improvement in our test.
> I will submit a SPIP later.
> SPIP: 
> https://docs.google.com/document/d/1hTXxsG_qLu5W_VrVvPx2gumXEFrnVPhQSRMjZ6WkfiY/edit#heading=h.7vhjx9226jbt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27227) Spark Runtime Filter

2019-04-28 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-27227:
-
Summary: Spark Runtime Filter  (was: Dynamic Partition Prune in Spark)

> Spark Runtime Filter
> 
>
> Key: SPARK-27227
> URL: https://issues.apache.org/jira/browse/SPARK-27227
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Priority: Major
>
> When we equi-join one big table with a smaller table, we can collect some 
> statistics from the smaller table side, and use it to the scan of big table 
> to do partition prune or data filter before execute the join.
> This can significantly improve SQL perfermance.
> For a simple example:
> select * from A, B where A.a = B.b
> A is big table ,B is small table.
> There are two scenarios:
> 1. A.a is a partition column of table A
>we can collect  all the values  of B.b, and send it to table A to do 
>partition prune on A.a.
> 2. A.a is not a partition column of table A
>   we can collect real-time some statistics(such as min/max/bloomfilter) of 
> B.b by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to 
> table A to do filter on A.a.
>   Addititionaly, if a more complex query select * from A join (select * from 
> B where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as 
> min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) 
> from X)
> Above two scenarios, we can filter out lots of data by partition prune or 
> data filter, thus we can imporve perfermance.
> 10TB TPC-DS  gain about 35%  improvement in our test.
> I will submit a SPIP later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2019-04-17 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819844#comment-16819844
 ] 

Song Jun commented on SPARK-19842:
--

I think Constraint should be designed with DataSource v2 and can do more than 
this jira.

Constraint can be used to:
1. data integrity(not include in this jira)
2. optimizer can use it to rewrite query to gain perfermance(not just PK/FK, 
unique/not null is also useful)

For data integrity, we have two scenarios:
1.1 DataSource native support data integrity, such as mysql/oracle and so on
Spark should only call read/write API of this DataSource, and do nothing 
about data integrity.
1.2 DataSource do not support data integrity, such as csv/json/parquet and so on
Spark can provide data integrity for this DataSource like Hive does(maybe a 
switch can be used to turn it off), and we can discuss to support which kind of 
Constraint.
For example, Hive support PK/FK/UNIQUE(DISABLE RELY)/NOT NUL/DEFAULT, NOT 
NULL ENFORCE check is implement by add an extra UDF 
GenericUDFEnforceNotNullConstraint to the 
Plan(https://issues.apache.org/jira/browse/HIVE-16605).

For Optimizer rewrite query:
2.1 We can add Constraint Information into CatalogTable which is returned by 
catalog.getTable API. Then Optimizer can use it to do query rewrite.
2.2 if we can not get Constraint information, we can use hint to the SQL

Above all, we can bring Constraint feature to DataSource v2 design:
a) to support 2.1 feature, we can add constraint information to 
createTable/alterTable/getTable API in this 
SPIP(https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#)
b) to support data integrity, we can add ConstaintSupport mix-in for DataSource 
v2:
  if one DataSource support Constraint, then Spark do nothing when insert data;
  if one DataSource do not support Constraint but still want to do constraint 
check, then Spark should do the constraint check like Hive(such as not null in 
Hive add a extra udf GenericUDFEnforceNotNullConstraint to the Plan).
  if one DataSource do not support Constraint and do not want to do constraint 
check, then Spark do nothing.


Hive catalog support constraint, we can implement this logic in 
createTable/alterTable API . Then we can use SparkSQL DDL to create Table with 
constraint which stored to HiveMetaStore by Hive catalog API.
for example:CREATE TABLE t(a STRING, b STRING NOT NULL DISABLE, CONSTRAINT pk1 
PRIMARY KEY (a) DISABLE) USING parquet;

As for how to store constraint, because Hive 2.1 has provide constraint API in 
Hive.java, we can call it directly in createTable/alterTable API of Hive 
catalog. There is no need to use table properties to store these
constraint information by Spark. There are some concern for using Hive 2.1 
catalog API directly in the 
docs(https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit#heading=h.lnxbz9),
 such as Spark built-in Hive is 1.2.1, but upgrade Hive to 2.3.4 is 
inprogress(https://issues.apache.org/jira/browse/SPARK-23710).

[~cloud_fan] [~ioana-delaney]
If this proposal is reasonable, please give me some feedback. Thanks!

> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
>Priority: Major
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, 
> constraint 

[jira] [Created] (SPARK-27280) infer filters from Join's OR condition

2019-03-26 Thread Song Jun (JIRA)
Song Jun created SPARK-27280:


 Summary: infer filters from Join's OR condition
 Key: SPARK-27280
 URL: https://issues.apache.org/jira/browse/SPARK-27280
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer, SQL
Affects Versions: 3.0.0
Reporter: Song Jun


In some case, We can infer filters from Join condition with OR expressions.

for example, tpc-ds query 48:

{code:java}
select sum (ss_quantity)
 from store_sales, store, customer_demographics, customer_address, date_dim
 where s_store_sk = ss_store_sk
 and  ss_sold_date_sk = d_date_sk and d_year = 2000
 and  
 (
  (
   cd_demo_sk = ss_cdemo_sk
   and 
   cd_marital_status = 'S'
   and 
   cd_education_status = 'Secondary'
   and 
   ss_sales_price between 100.00 and 150.00  
   )
 or
  (
  cd_demo_sk = ss_cdemo_sk
   and 
   cd_marital_status = 'M'
   and 
   cd_education_status = 'College'
   and 
   ss_sales_price between 50.00 and 100.00   
  )
 or 
 (
  cd_demo_sk = ss_cdemo_sk
  and 
   cd_marital_status = 'U'
   and 
   cd_education_status = '2 yr Degree'
   and 
   ss_sales_price between 150.00 and 200.00  
 )
 )
 and
 (
  (
  ss_addr_sk = ca_address_sk
  and
  ca_country = 'United States'
  and
  ca_state in ('AL', 'OH', 'MD')
  and ss_net_profit between 0 and 2000  
  )
 or
  (ss_addr_sk = ca_address_sk
  and
  ca_country = 'United States'
  and
  ca_state in ('VA', 'TX', 'IA')
  and ss_net_profit between 150 and 3000 
  )
 or
  (ss_addr_sk = ca_address_sk
  and
  ca_country = 'United States'
  and
  ca_state in ('RI', 'WI', 'KY')
  and ss_net_profit between 50 and 25000 
  )
 )
;
{code}

we can infer two filters from the join or condidtion:

{code:java}
for customer_demographics:
cd_marital_status in(‘D',‘U',‘M') and cd_education_status in('4 yr 
Degree’,’Secondary’,’Primary')

for store_sales:
 (ss_sales_price between 100.00 and 150.00 or ss_sales_price between 50.00 and 
100.00 or ss_sales_price between 150.00 and 200.00)
{code}

then then we can push down the above two filters to filter  
customer_demographics/store_sales.

A pr will be submit soon.






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27229) GroupBy Placement in Intersect Distinct

2019-03-22 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-27229:
-
Priority: Major  (was: Minor)

> GroupBy Placement in Intersect Distinct
> ---
>
> Key: SPARK-27229
> URL: https://issues.apache.org/jira/browse/SPARK-27229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Priority: Major
>
> Intersect  operator will be replace by Left Semi Join in Optimizer.
> for example:
> SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
>  ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND 
> a2<=>b2
> if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce 
> the table data before
> Join by place groupby operator under join, that is 
> ==>  
> SELECT a1, a2 FROM 
>(SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X
>LEFT SEMI JOIN 
>(SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y
> ON X.a1<=>Y.b1 AND X.a2<=>Y.b2
> then we can have smaller table data when execute join, because  group by has 
> cut lots of 
>  data.
>  
> A pr will be submit soon



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27229) GroupBy Placement in Intersect Distinct

2019-03-22 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-27229:
-
Priority: Minor  (was: Major)

> GroupBy Placement in Intersect Distinct
> ---
>
> Key: SPARK-27229
> URL: https://issues.apache.org/jira/browse/SPARK-27229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Priority: Minor
>
> Intersect  operator will be replace by Left Semi Join in Optimizer.
> for example:
> SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
>  ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND 
> a2<=>b2
> if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce 
> the table data before
> Join by place groupby operator under join, that is 
> ==>  
> SELECT a1, a2 FROM 
>(SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X
>LEFT SEMI JOIN 
>(SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y
> ON X.a1<=>Y.b1 AND X.a2<=>Y.b2
> then we can have smaller table data when execute join, because  group by has 
> cut lots of 
>  data.
>  
> A pr will be submit soon



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27229) GroupBy Placement in Intersect Distinct

2019-03-22 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798748#comment-16798748
 ] 

Song Jun commented on SPARK-27229:
--

Thanks

> GroupBy Placement in Intersect Distinct
> ---
>
> Key: SPARK-27229
> URL: https://issues.apache.org/jira/browse/SPARK-27229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Priority: Major
>
> Intersect  operator will be replace by Left Semi Join in Optimizer.
> for example:
> SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
>  ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND 
> a2<=>b2
> if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce 
> the table data before
> Join by place groupby operator under join, that is 
> ==>  
> SELECT a1, a2 FROM 
>(SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X
>LEFT SEMI JOIN 
>(SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y
> ON X.a1<=>Y.b1 AND X.a2<=>Y.b2
> then we can have smaller table data when execute join, because  group by has 
> cut lots of 
>  data.
>  
> A pr will be submit soon



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27229) GroupBy Placement in Intersect Distinct

2019-03-21 Thread Song Jun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-27229:
-
Description: 
Intersect  operator will be replace by Left Semi Join in Optimizer.

for example:
SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
 ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND 
a2<=>b2

if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce the 
table data before
Join by place groupby operator under join, that is 

==>  
SELECT a1, a2 FROM 
   (SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X
   LEFT SEMI JOIN 
   (SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y
ON X.a1<=>Y.b1 AND X.a2<=>Y.b2

then we can have smaller table data when execute join, because  group by has 
cut lots of 
 data.
 
A pr will be submit soon



  was:
Intersect  operator will be replace by Left Semi Join in Optimizer.

for example:
SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
 ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND 
a2<=>b2

if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce the 
table data before
Join by place groupby operator under join, that is 

==>  
SELECT a1, a2 FROM 
   (SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X
   LEFT SEMI JOIN 
   (SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y
ON X.a1<=>Y.b1 AND X.a2<=>Y.b2

then we can have smaller table data when execute join, because  group by has 
cut lots of 
 data





> GroupBy Placement in Intersect Distinct
> ---
>
> Key: SPARK-27229
> URL: https://issues.apache.org/jira/browse/SPARK-27229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Priority: Critical
>
> Intersect  operator will be replace by Left Semi Join in Optimizer.
> for example:
> SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
>  ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND 
> a2<=>b2
> if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce 
> the table data before
> Join by place groupby operator under join, that is 
> ==>  
> SELECT a1, a2 FROM 
>(SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X
>LEFT SEMI JOIN 
>(SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y
> ON X.a1<=>Y.b1 AND X.a2<=>Y.b2
> then we can have smaller table data when execute join, because  group by has 
> cut lots of 
>  data.
>  
> A pr will be submit soon



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27229) GroupBy Placement in Intersect Distinct

2019-03-21 Thread Song Jun (JIRA)
Song Jun created SPARK-27229:


 Summary: GroupBy Placement in Intersect Distinct
 Key: SPARK-27229
 URL: https://issues.apache.org/jira/browse/SPARK-27229
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Song Jun


Intersect  operator will be replace by Left Semi Join in Optimizer.

for example:
SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
 ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND 
a2<=>b2

if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce the 
table data before
Join by place groupby operator under join, that is 

==>  
SELECT a1, a2 FROM 
   (SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X
   LEFT SEMI JOIN 
   (SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y
ON X.a1<=>Y.b1 AND X.a2<=>Y.b2

then we can have smaller table data when execute join, because  group by has 
cut lots of 
 data






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27227) Dynamic Partition Prune in Spark

2019-03-21 Thread Song Jun (JIRA)
Song Jun created SPARK-27227:


 Summary: Dynamic Partition Prune in Spark
 Key: SPARK-27227
 URL: https://issues.apache.org/jira/browse/SPARK-27227
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Song Jun


When we equi-join one big table with a smaller table, we can collect some 
statistics from the smaller table side, and use it to the scan of big table to 
do partition prune or data filter before execute the join.
This can significantly improve SQL perfermance.

For a simple example:
select * from A, B where A.a = B.b
A is big table ,B is small table.

There are two scenarios:
1. A.a is a partition column of table A
   we can collect  all the values  of B.b, and send it to table A to do 
   partition prune on A.a.
2. A.a is not a partition column of table A
  we can collect real-time some statistics(such as min/max/bloomfilter) of B.b 
by execute extra sql(select max(b),min(b),bbf(b) from B), and send it to table 
A to do filter on A.a.
  Addititionaly, if a more complex query select * from A join (select * from B 
where B.c = 1) X on A.a = B.b, then we collect real-time statistics(such as 
min/max/bloomfilter) of X by execute extra sql(select max(b),min(b),bbf(b) from 
X)

Above two scenarios, we can filter out lots of data by partition prune or data 
filter, thus we can imporve perfermance.

10TB TPC-DS  gain about 35%  improvement in our test.

I will submit a SPIP later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20960) make ColumnVector public

2017-07-20 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-20960:
-
Description: 
ColumnVector is an internal interface in Spark SQL, which is only used for 
vectorized parquet reader to represent the in-memory columnar format.

In Spark 2.3 we want to make ColumnVector public, so that we can provide a more 
efficient way for data exchanges between Spark and external systems. For 
example, we can use ColumnVector to build the columnar read API in data source 
framework, we can use ColumnVector to build a more efficient UDF API, etc.

We also want to introduce a new ColumnVector implementation based on Apache 
Arrow(basically just a wrapper over Arrow), so that external systems(like 
Python Pandas DataFrame) can build ColumnVector very easily.

  was:
_emphasized text_ColumnVector is an internal interface in Spark SQL, which is 
only used for vectorized parquet reader to represent the in-memory columnar 
format.

In Spark 2.3 we want to make ColumnVector public, so that we can provide a more 
efficient way for data exchanges between Spark and external systems. For 
example, we can use ColumnVector to build the columnar read API in data source 
framework, we can use ColumnVector to build a more efficient UDF API, etc.

We also want to introduce a new ColumnVector implementation based on Apache 
Arrow(basically just a wrapper over Arrow), so that external systems(like 
Python Pandas DataFrame) can build ColumnVector very easily.


> make ColumnVector public
> 
>
> Key: SPARK-20960
> URL: https://issues.apache.org/jira/browse/SPARK-20960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>
> ColumnVector is an internal interface in Spark SQL, which is only used for 
> vectorized parquet reader to represent the in-memory columnar format.
> In Spark 2.3 we want to make ColumnVector public, so that we can provide a 
> more efficient way for data exchanges between Spark and external systems. For 
> example, we can use ColumnVector to build the columnar read API in data 
> source framework, we can use ColumnVector to build a more efficient UDF API, 
> etc.
> We also want to introduce a new ColumnVector implementation based on Apache 
> Arrow(basically just a wrapper over Arrow), so that external systems(like 
> Python Pandas DataFrame) can build ColumnVector very easily.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20960) make ColumnVector public

2017-07-20 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-20960:
-
Description: 
_emphasized text_ColumnVector is an internal interface in Spark SQL, which is 
only used for vectorized parquet reader to represent the in-memory columnar 
format.

In Spark 2.3 we want to make ColumnVector public, so that we can provide a more 
efficient way for data exchanges between Spark and external systems. For 
example, we can use ColumnVector to build the columnar read API in data source 
framework, we can use ColumnVector to build a more efficient UDF API, etc.

We also want to introduce a new ColumnVector implementation based on Apache 
Arrow(basically just a wrapper over Arrow), so that external systems(like 
Python Pandas DataFrame) can build ColumnVector very easily.

  was:
ColumnVector is an internal interface in Spark SQL, which is only used for 
vectorized parquet reader to represent the in-memory columnar format.

In Spark 2.3 we want to make ColumnVector public, so that we can provide a more 
efficient way for data exchanges between Spark and external systems. For 
example, we can use ColumnVector to build the columnar read API in data source 
framework, we can use ColumnVector to build a more efficient UDF API, etc.

We also want to introduce a new ColumnVector implementation based on Apache 
Arrow(basically just a wrapper over Arrow), so that external systems(like 
Python Pandas DataFrame) can build ColumnVector very easily.


> make ColumnVector public
> 
>
> Key: SPARK-20960
> URL: https://issues.apache.org/jira/browse/SPARK-20960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>
> _emphasized text_ColumnVector is an internal interface in Spark SQL, which is 
> only used for vectorized parquet reader to represent the in-memory columnar 
> format.
> In Spark 2.3 we want to make ColumnVector public, so that we can provide a 
> more efficient way for data exchanges between Spark and external systems. For 
> example, we can use ColumnVector to build the columnar read API in data 
> source framework, we can use ColumnVector to build a more efficient UDF API, 
> etc.
> We also want to introduce a new ColumnVector implementation based on Apache 
> Arrow(basically just a wrapper over Arrow), so that external systems(like 
> Python Pandas DataFrame) can build ColumnVector very easily.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20013) merge renameTable to alterTable in ExternalCatalog

2017-03-18 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-20013:
-
Description: 
Currently when we create / rename a managed table, we should get the 
defaultTablePath for them in ExternalCatalog, then we have two defaultTablePath 
logic in its two subclass HiveExternalCatalog and InMemoryCatalog, additionally 
there is also a defaultTablePath in SessionCatalog, so till now we have three 
defaultTablePath in three classes. 
we'd better to unify them up to SessionCatalog

To unify them, we should move some logic from ExternalCatalog to 
SessionCatalog, renameTable is one of this.

while limit to the simple parameters in renameTable 
{code}
  def renameTable(db: String, oldName: String, newName: String): Unit
{code}
even if we move the defaultTablePath logic to SessionCatalog, we can not pass 
it to renameTable.

So we can add a newTablePath parameter  for renameTable in ExternalCatalog 

  was:
merge renameTable to alterTable in ExternalCatalog has some reasons:
1. In Hive, we rename a Table by alterTable
2. Currently when we create / rename a managed table, we should get the 
defaultTablePath for them in ExternalCatalog, then we have two defaultTablePath 
logic in its two subclass HiveExternalCatalog and InMemoryCatalog, additionally 
there is also a defaultTablePath in SessionCatalog, so till now we have three 
defaultTablePath in three classes. 
we'd better to unify them up to SessionCatalog

To unify them, we should move some logic from ExternalCatalog to 
SessionCatalog, renameTable is one of this.

while limit to the simple parameters in renameTable 
{code}
  def renameTable(db: String, oldName: String, newName: String): Unit
{code}
even if we move the defaultTablePath logic to SessionCatalog, we can not pass 
it to renameTable.

So we can merge the renameTable  to alterTable, and rename it in alterTable.


> merge renameTable to alterTable in ExternalCatalog
> --
>
> Key: SPARK-20013
> URL: https://issues.apache.org/jira/browse/SPARK-20013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> Currently when we create / rename a managed table, we should get the 
> defaultTablePath for them in ExternalCatalog, then we have two 
> defaultTablePath logic in its two subclass HiveExternalCatalog and 
> InMemoryCatalog, additionally there is also a defaultTablePath in 
> SessionCatalog, so till now we have three defaultTablePath in three classes. 
> we'd better to unify them up to SessionCatalog
> To unify them, we should move some logic from ExternalCatalog to 
> SessionCatalog, renameTable is one of this.
> while limit to the simple parameters in renameTable 
> {code}
>   def renameTable(db: String, oldName: String, newName: String): Unit
> {code}
> even if we move the defaultTablePath logic to SessionCatalog, we can not pass 
> it to renameTable.
> So we can add a newTablePath parameter  for renameTable in ExternalCatalog 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20013) merge renameTable to alterTable in ExternalCatalog

2017-03-18 Thread Song Jun (JIRA)
Song Jun created SPARK-20013:


 Summary: merge renameTable to alterTable in ExternalCatalog
 Key: SPARK-20013
 URL: https://issues.apache.org/jira/browse/SPARK-20013
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


merge renameTable to alterTable in ExternalCatalog has some reasons:
1. In Hive, we rename a Table by alterTable
2. Currently when we create / rename a managed table, we should get the 
defaultTablePath for them in ExternalCatalog, then we have two defaultTablePath 
logic in its two subclass HiveExternalCatalog and InMemoryCatalog, additionally 
there is also a defaultTablePath in SessionCatalog, so till now we have three 
defaultTablePath in three classes. 
we'd better to unify them up to SessionCatalog

To unify them, we should move some logic from ExternalCatalog to 
SessionCatalog, renameTable is one of this.

while limit to the simple parameters in renameTable 
{code}
  def renameTable(db: String, oldName: String, newName: String): Unit
{code}
even if we move the defaultTablePath logic to SessionCatalog, we can not pass 
it to renameTable.

So we can merge the renameTable  to alterTable, and rename it in alterTable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using

2017-03-16 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929409#comment-15929409
 ] 

Song Jun edited comment on SPARK-19990 at 3/17/17 4:36 AM:
---

the root cause is [the csvfile path in this test 
case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703]
 is 
"jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv",
 which will failed when new Path() [new Path in datasource.scala 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344]

and the cars.csv are stored in module core's resources.

after we merge the HiveDDLSuit and DDLSuit  in SPARK-19235   
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if
 we test module hive, we will run the DDLSuit in the core module, and this will 
cause that we get the illegal path like 'jar:file:/xxx' above.

it is not related with SPARK-19763

I will fix this by providing a new test dir which contain the test files in 
sql/ , and the test case use this file path.

thanks~



was (Author: windpiger):
the root cause is [the csvfile path in this test 
case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703]
 is 
"jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv",
 which will failed when new Path() [new Path in datasource.scala 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344]

and the cars.csv are stored in module core's resources.

after we merge the HiveDDLSuit and DDLSuit 
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if
 we test module hive, we will run the DDLSuit in the core module, and this will 
cause that we get the illegal path like 'jar:file:/xxx' above.

it is not related with SPARK-19763

I will fix this by providing a new test dir which contain the test files in 
sql/ , and the test case use this file path.

thanks~


> Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create 
> temporary view using
> --
>
> Key: SPARK-19990
> URL: https://issues.apache.org/jira/browse/SPARK-19990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>
> This test seems to be failing consistently on all of the maven builds: 
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using
>  and is possibly caused by SPARK-19763.
> Here's a stack trace for the failure: 
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:172)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
>   at 
> 

[jira] [Comment Edited] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using

2017-03-16 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929409#comment-15929409
 ] 

Song Jun edited comment on SPARK-19990 at 3/17/17 4:35 AM:
---

the root cause is [the csvfile path in this test 
case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703]
 is 
"jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv",
 which will failed when new Path() [new Path in datasource.scala 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344]

and the cars.csv are stored in module core's resources.

after we merge the HiveDDLSuit and DDLSuit 
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if
 we test module hive, we will run the DDLSuit in the core module, and this will 
cause that we get the illegal path like 'jar:file:/xxx' above.

it is not related with SPARK-19763

I will fix this by providing a new test dir which contain the test files in 
sql/ , and the test case use this file path.

thanks~



was (Author: windpiger):
the root cause is [the csvfile path in this test 
case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703]
 is 
"jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv",
 which will failed when new Path() [new Path in datasource.scala 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344]

and the cars.csv are stored in module core's resources.

after we merge the HiveDDLSuit and DDLSuit 
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if
 we test module hive, we will run the DDLSuit in the core module, and this will 
cause that we get the illegal path like 'jar:file:/xxx' above.


> Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create 
> temporary view using
> --
>
> Key: SPARK-19990
> URL: https://issues.apache.org/jira/browse/SPARK-19990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>
> This test seems to be failing consistently on all of the maven builds: 
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using
>  and is possibly caused by SPARK-19763.
> Here's a stack trace for the failure: 
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:172)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
>   at 
> 

[jira] [Commented] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using

2017-03-16 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15929409#comment-15929409
 ] 

Song Jun commented on SPARK-19990:
--

the root cause is [the csvfile path in this test 
case|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L703]
 is 
"jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv",
 which will failed when new Path() [new Path in datasource.scala 
|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L344]

and the cars.csv are stored in module core's resources.

after we merge the HiveDDLSuit and DDLSuit 
https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9,if
 we test module hive, we will run the DDLSuit in the core module, and this will 
cause that we get the illegal path like 'jar:file:/xxx' above.


> Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create 
> temporary view using
> --
>
> Key: SPARK-19990
> URL: https://issues.apache.org/jira/browse/SPARK-19990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Kay Ousterhout
>
> This test seems to be failing consistently on all of the maven builds: 
> https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite_name=create+temporary+view+using
>  and is possibly caused by SPARK-19763.
> Here's a stack trace for the failure: 
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: 
> jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv
>   at org.apache.hadoop.fs.Path.initialize(Path.java:206)
>   at org.apache.hadoop.fs.Path.(Path.java:172)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343)
>   at 
> org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38$$anonfun$apply$mcV$sp$8.apply$mcV$sp(DDLSuite.scala:705)
>   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withView(SQLTestUtils.scala:186)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite.withView(DDLSuite.scala:171)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply$mcV$sp(DDLSuite.scala:704)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701)
>   at 
> org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
>   at 
> 

[jira] [Created] (SPARK-19961) unify a exception erro msg for dropdatabase

2017-03-15 Thread Song Jun (JIRA)
Song Jun created SPARK-19961:


 Summary: unify a exception erro msg for dropdatabase
 Key: SPARK-19961
 URL: https://issues.apache.org/jira/browse/SPARK-19961
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


unify a exception erro msg for dropdatabase when the database still have some 
tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19945) Add test case for SessionCatalog with HiveExternalCatalog

2017-03-14 Thread Song Jun (JIRA)
Song Jun created SPARK-19945:


 Summary: Add test case for SessionCatalog with HiveExternalCatalog
 Key: SPARK-19945
 URL: https://issues.apache.org/jira/browse/SPARK-19945
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


Currently SessionCatalogSuite is only for InMemoryCatalog, there is no suite 
for HiveExternalCatalog.
And there are some ddl function is not proper to test in ExternalCatalogSuite, 
because some logic are not full implement in ExternalCatalog, these ddl 
functions are full implement in SessionCatalog, it is better to test it in 
SessionCatalogSuite

So we should add a test suite for SessionCatalog with HiveExternalCatalog



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19945) Add test suite for SessionCatalog with HiveExternalCatalog

2017-03-14 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19945:
-
Summary: Add test suite for SessionCatalog with HiveExternalCatalog  (was: 
Add test case for SessionCatalog with HiveExternalCatalog)

> Add test suite for SessionCatalog with HiveExternalCatalog
> --
>
> Key: SPARK-19945
> URL: https://issues.apache.org/jira/browse/SPARK-19945
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> Currently SessionCatalogSuite is only for InMemoryCatalog, there is no suite 
> for HiveExternalCatalog.
> And there are some ddl function is not proper to test in 
> ExternalCatalogSuite, because some logic are not full implement in 
> ExternalCatalog, these ddl functions are full implement in SessionCatalog, it 
> is better to test it in SessionCatalogSuite
> So we should add a test suite for SessionCatalog with HiveExternalCatalog



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19917) qualified partition location stored in catalog

2017-03-10 Thread Song Jun (JIRA)
Song Jun created SPARK-19917:


 Summary: qualified partition location stored in catalog
 Key: SPARK-19917
 URL: https://issues.apache.org/jira/browse/SPARK-19917
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


partition path should be qualified to store in catalog. 
There are some scenes:
1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x' 
   qualified: file:/path/x
2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x' 
 qualified: file:/tablelocation/x
3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x'
   qualified: file:/path/x
4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x'
 qualified: file:/tablelocation/x

Currently only  ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde table 
has the expected qualified path. we should make other scenes to be consist with 
it.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19869) move table related ddl from ddl.scala to tables.scala

2017-03-08 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19869:
-
Issue Type: Improvement  (was: Bug)

> move table related ddl from ddl.scala to tables.scala
> -
>
> Key: SPARK-19869
> URL: https://issues.apache.org/jira/browse/SPARK-19869
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> move table related ddl from ddl.scala to tables.scala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19869) move table related ddl from ddl.scala to tables.scala

2017-03-08 Thread Song Jun (JIRA)
Song Jun created SPARK-19869:


 Summary: move table related ddl from ddl.scala to tables.scala
 Key: SPARK-19869
 URL: https://issues.apache.org/jira/browse/SPARK-19869
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


move table related ddl from ddl.scala to tables.scala



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19864) add makeQualifiedPath in SQLTestUtils to optimize some code

2017-03-08 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19864:
-
Summary: add makeQualifiedPath in SQLTestUtils to optimize some code  (was: 
add makeQualifiedPath in CatalogUtils to optimize some code)

> add makeQualifiedPath in SQLTestUtils to optimize some code
> ---
>
> Key: SPARK-19864
> URL: https://issues.apache.org/jira/browse/SPARK-19864
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> Currently there are lots of places to make the path qualified, it is better 
> to provide a function to do this, then the code will be more simple.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19867) merge defaultTablePath logic when create table for InMemroyCatalog and HiveExternalCatalog

2017-03-08 Thread Song Jun (JIRA)
Song Jun created SPARK-19867:


 Summary: merge defaultTablePath logic when create table for 
InMemroyCatalog and HiveExternalCatalog
 Key: SPARK-19867
 URL: https://issues.apache.org/jira/browse/SPARK-19867
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


if we create a managed table, we will set a defaultTablePath for this table, 
Currently the defaultTablePath exists in both InMemoryCatalog and 
HiveExternalCatalog, it is better to merge them up to SessionCatalog



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19864) add makeQualifiedPath in CatalogUtils to optimize some code

2017-03-07 Thread Song Jun (JIRA)
Song Jun created SPARK-19864:


 Summary: add makeQualifiedPath in CatalogUtils to optimize some 
code
 Key: SPARK-19864
 URL: https://issues.apache.org/jira/browse/SPARK-19864
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


Currently there are lots of places to make the path qualified, it is better to 
provide a function to do this, then the code will be more simple.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19836) Customizable remote repository url for hive versions unit test

2017-03-07 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899292#comment-15899292
 ] 

Song Jun commented on SPARK-19836:
--

I have do this similar https://github.com/apache/spark/pull/16803

> Customizable remote repository url for hive versions unit test
> --
>
> Key: SPARK-19836
> URL: https://issues.apache.org/jira/browse/SPARK-19836
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Elek, Marton
>  Labels: ivy, unittest
>
> When the VersionSuite test runs from sql/hive it downloads different versions 
> from hive.
> Unfortunately the IsolatedClientClassloader (which is used by the 
> VersionSuite) uses hardcoded fix repositories:
> {code}
> val classpath = quietly {
>   SparkSubmitUtils.resolveMavenCoordinates(
> hiveArtifacts.mkString(","),
> SparkSubmitUtils.buildIvySettings(
>   Some("http://www.datanucleus.org/downloads/maven2;),
>   ivyPath),
> exclusions = version.exclusions)
> }
> {code}
> The problem is with the hard-coded repositories:
>  1. it's hard to run unit tests in an environment where only one internal 
> maven repository is available (and central/datanucleus is not)
>  2. it's impossible to run unit tests against custom built hive/hadoop 
> artifacts (which are not available from the central repository)
> VersionSuite has already a specific SPARK_VERSIONS_SUITE_IVY_PATH environment 
> variable to define a custom local repository as ivy cache.
> I suggest to add an additional environment variable 
> (SPARK_VERSIONS_SUITE_IVY_REPOSITORIES to the HiveClientBuilder.scala), to 
> make it possible adding new remote repositories for testing the different 
> hive versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19845) failed to uncache datasource table after the table location altered

2017-03-06 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898626#comment-15898626
 ] 

Song Jun edited comment on SPARK-19845 at 3/7/17 2:33 AM:
--

yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784
 the location changed, it make it more complex to uncache the table and recache 
other tables reference this.

I will dig it more


was (Author: windpiger):
yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784
 the location changed, it make it more complex to uncache the table and recache 
other tables reference this.

> failed to uncache datasource table after the table location altered
> ---
>
> Key: SPARK-19845
> URL: https://issues.apache.org/jira/browse/SPARK-19845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> Currently if we first cache a datasource table, then we alter the table 
> location,
> then we drop the table, uncache table will failed in the DropTableCommand, 
> because the location has changed and sameResult for two InMemoryFileIndex 
> with different location return false, so we can't find the table key in the 
> cache.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19845) failed to uncache datasource table after the table location altered

2017-03-06 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898626#comment-15898626
 ] 

Song Jun commented on SPARK-19845:
--

yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784
 the location changed, it make it more complex to uncache the table and recache 
other tables reference this.

> failed to uncache datasource table after the table location altered
> ---
>
> Key: SPARK-19845
> URL: https://issues.apache.org/jira/browse/SPARK-19845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> Currently if we first cache a datasource table, then we alter the table 
> location,
> then we drop the table, uncache table will failed in the DropTableCommand, 
> because the location has changed and sameResult for two InMemoryFileIndex 
> with different location return false, so we can't find the table key in the 
> cache.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19845) failed to uncache datasource table after the table location altered

2017-03-06 Thread Song Jun (JIRA)
Song Jun created SPARK-19845:


 Summary: failed to uncache datasource table after the table 
location altered
 Key: SPARK-19845
 URL: https://issues.apache.org/jira/browse/SPARK-19845
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


Currently if we first cache a datasource table, then we alter the table 
location,
then we drop the table, uncache table will failed in the DropTableCommand, 
because the location has changed and sameResult for two InMemoryFileIndex with 
different location return false, so we can't find the table key in the cache.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19833) remove SQLConf.HIVE_VERIFY_PARTITION_PATH, we always return empty when the path does not exists

2017-03-06 Thread Song Jun (JIRA)
Song Jun created SPARK-19833:


 Summary: remove SQLConf.HIVE_VERIFY_PARTITION_PATH, we always 
return empty when the path does not exists
 Key: SPARK-19833
 URL: https://issues.apache.org/jira/browse/SPARK-19833
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


In SPARK-5068, we introduce a SQLConf spark.sql.hive.verifyPartitionPath,
if it is set to true, it will avoid the task failed when the patition location 
does not exists in the filesystem.

this situation should always return emtpy and don't lead to the task failed, 
here we remove this conf.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19832) DynamicPartitionWriteTask should escape the partition name

2017-03-05 Thread Song Jun (JIRA)
Song Jun created SPARK-19832:


 Summary: DynamicPartitionWriteTask should escape the partition 
name 
 Key: SPARK-19832
 URL: https://issues.apache.org/jira/browse/SPARK-19832
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


Currently in DynamicPartitionWriteTask, when we get the paritionPath of a 
parition, we just escape the partition value, not escape the partition name.

this will cause some problems for some  special partition name situation, for 
example :
1) if the partition name contains '%' etc,  there will be two partition path 
created in the filesytem, one is for escaped path like '/path/a%25b=1', another 
is for unescaped path like '/path/a%b=1'.
and the data inserted stored in unescaped path, while the show partitions table 
will return 'a%25b=1' which the partition name is escaped. So here it is not 
consist. And I think the data should be stored in the escaped path in 
filesystem, which Hive2.0.0 also have the same action.

2) if the partition name contains ':', there will throw exception that new 
Path("/path","a:b"), this is illegal which has a colon in the relative path.

{code}
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path 
in absolute URI: a:b
  at org.apache.hadoop.fs.Path.initialize(Path.java:205)
  at org.apache.hadoop.fs.Path.(Path.java:171)
  at org.apache.hadoop.fs.Path.(Path.java:88)
  ... 48 elided
Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b
  at java.net.URI.checkPath(URI.java:1823)
  at java.net.URI.(URI.java:745)
  at org.apache.hadoop.fs.Path.initialize(Path.java:202)
  ... 50 more
{code}






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19784) refresh datasource table after alter the location

2017-03-01 Thread Song Jun (JIRA)
Song Jun created SPARK-19784:


 Summary: refresh datasource table after alter the location
 Key: SPARK-19784
 URL: https://issues.apache.org/jira/browse/SPARK-19784
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


currently if we alter the location of a datasource table, then we select from 
it, it still return the data of  the old location.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19763) qualified external datasource table location stored in catalog

2017-02-27 Thread Song Jun (JIRA)
Song Jun created SPARK-19763:


 Summary: qualified external datasource table location stored in 
catalog
 Key: SPARK-19763
 URL: https://issues.apache.org/jira/browse/SPARK-19763
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


If we create a external datasource table with a non-qualified location , we 
should qualified it to store in catalog.

{code}
CREATE TABLE t(a string)
USING parquet
LOCATION '/path/xx'


CREATE TABLE t1(a string, b string)
USING parquet
PARTITIONED BY(b)
LOCATION '/path/xx'
{code}

when we get the table from catalog, the location should be qualified, 
e.g.'file:/path/xxx' 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19761) create InMemoryFileIndex with empty rootPaths when set PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero

2017-02-27 Thread Song Jun (JIRA)
Song Jun created SPARK-19761:


 Summary: create InMemoryFileIndex with empty rootPaths when set 
PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero
 Key: SPARK-19761
 URL: https://issues.apache.org/jira/browse/SPARK-19761
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


if we create a InMemoryFileIndex with an empty rootPaths when set 
PARALLEL_PARTITION_DISCOVERY_THRESHOLD to zero, it will throw an  exception:

{code}
Positive number of slices required
java.lang.IllegalArgumentException: Positive number of slices required
at 
org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:119)
at 
org.apache.spark.rdd.ParallelCollectionRDD.getPartitions(ParallelCollectionRDD.scala:97)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2084)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex$.org$apache$spark$sql$execution$datasources$PartitioningAwareFileIndex$$bulkListLeafFiles(PartitioningAwareFileIndex.scala:357)
at 
org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.listLeafFiles(PartitioningAwareFileIndex.scala:256)
at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:74)
at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex.(InMemoryFileIndex.scala:50)
at 
org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9$$anonfun$apply$mcV$sp$2.apply$mcV$sp(FileIndexSuite.scala:186)
at 
org.apache.spark.sql.test.SQLTestUtils$class.withSQLConf(SQLTestUtils.scala:105)
at 
org.apache.spark.sql.execution.datasources.FileIndexSuite.withSQLConf(FileIndexSuite.scala:33)
at 
org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply$mcV$sp(FileIndexSuite.scala:185)
at 
org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply(FileIndexSuite.scala:185)
at 
org.apache.spark.sql.execution.datasources.FileIndexSuite$$anonfun$9.apply(FileIndexSuite.scala:185)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19742) When using SparkSession to write a dataset to Hive the schema is ignored

2017-02-27 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885399#comment-15885399
 ] 

Song Jun commented on SPARK-19742:
--

this is expected, see the comment.

{code}
/**
   * Inserts the content of the `DataFrame` to the specified table. It requires 
that
   * the schema of the `DataFrame` is the same as the schema of the table.
   *
   * @note Unlike `saveAsTable`, `insertInto` ignores the column names and just 
uses position-based
   * resolution. For example:
   *
   * {{{
   *scala> Seq((1, 2)).toDF("i", 
"j").write.mode("overwrite").saveAsTable("t1")
   *scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1")
   *scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1")
   *scala> sql("select * from t1").show
   *+---+---+
   *|  i|  j|
   *+---+---+
   *|  5|  6|
   *|  3|  4|
   *|  1|  2|
   *+---+---+
   * }}}
   *
   * Because it inserts data to an existing table, format or options will be 
ignored.
   *
   * @since 1.4.0
   */
  def insertInto(tableName: String): Unit = {

insertInto(df.sparkSession.sessionState.sqlParser.parseTableIdentifier(tableName))
  }
{code}

> When using SparkSession to write a dataset to Hive the schema is ignored
> 
>
> Key: SPARK-19742
> URL: https://issues.apache.org/jira/browse/SPARK-19742
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.1
> Environment: Running on Ubuntu with HDP 2.4.
>Reporter: Navin Goel
>
> I am saving a Dataset that is created form reading a json and some selects 
> and filters into a hive table. The dataset.write().insertInto function does 
> not look at schema when writing to the table but instead writes in order to 
> the hive table.
> The schemas for both the tables are same.
> schema printed from spark of the dataset being written:
> StructType(StructField(countrycode,StringType,true), 
> StructField(systemflag,StringType,true), 
> StructField(classcode,StringType,true), 
> StructField(classname,StringType,true), 
> StructField(rangestart,StringType,true), 
> StructField(rangeend,StringType,true), 
> StructField(tablename,StringType,true), 
> StructField(last_updated_date,TimestampType,true))
> Schema of the dataset after loading the same table from Hive:
> StructType(StructField(systemflag,StringType,true), 
> StructField(RangeEnd,StringType,true), 
> StructField(classcode,StringType,true), 
> StructField(classname,StringType,true), 
> StructField(last_updated_date,TimestampType,true), 
> StructField(countrycode,StringType,true), 
> StructField(rangestart,StringType,true), 
> StructField(tablename,StringType,true))



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19748) refresh for InMemoryFileIndex with FileStatusCache does not work correctly

2017-02-26 Thread Song Jun (JIRA)
Song Jun created SPARK-19748:


 Summary: refresh for InMemoryFileIndex with FileStatusCache does 
not work correctly
 Key: SPARK-19748
 URL: https://issues.apache.org/jira/browse/SPARK-19748
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


If we refresh a InMemoryFileIndex with a FileStatusCache, it will first use the 
FileStatusCache to generate the cachedLeafFiles etc, then call 
FileStatusCache.invalidateAll. the order to do these two actions is wrong, this 
lead to the refresh action does not take effect.

{code}
  override def refresh(): Unit = {
refresh0()
fileStatusCache.invalidateAll()
  }

  private def refresh0(): Unit = {
val files = listLeafFiles(rootPaths)
cachedLeafFiles =
  new mutable.LinkedHashMap[Path, FileStatus]() ++= files.map(f => 
f.getPath -> f)
cachedLeafDirToChildrenFiles = files.toArray.groupBy(_.getPath.getParent)
cachedPartitionSpec = null
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19724) create a managed table with an existed default location should throw an exception

2017-02-24 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19724:
-
Summary: create a managed table with an existed default location should 
throw an exception  (was: create managed table for hive tables with an existed 
default location should throw an exception)

> create a managed table with an existed default location should throw an 
> exception
> -
>
> Key: SPARK-19724
> URL: https://issues.apache.org/jira/browse/SPARK-19724
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> This JIRA is a follow up work after 
> [SPARK-19583](https://issues.apache.org/jira/browse/SPARK-19583)
> As we discussed in that [PR](https://github.com/apache/spark/pull/16938)
> The following DDL for a managed table with an existed default location should 
> throw an exception:
> {code}
> CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
> CREATE TABLE ... (PARTITIONED BY ...)
> {code}
> Currently there are some situations which are not consist with above logic:
> 1. CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default 
> location
> situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog)
> 2. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
> situation: hive table succeed with an existed default location



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19724) create managed table for hive tables with an existed default location should throw an exception

2017-02-24 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19724:
-
Description: 
This JIRA is a follow up work after 
[SPARK-19583](https://issues.apache.org/jira/browse/SPARK-19583)

As we discussed in that [PR](https://github.com/apache/spark/pull/16938)

The following DDL for a managed table with an existed default location should 
throw an exception:
{code}
CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
CREATE TABLE ... (PARTITIONED BY ...)
{code}
Currently there are some situations which are not consist with above logic:

1. CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default 
location
situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog)

2. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
situation: hive table succeed with an existed default location


  was:
This JIRA is a follow up work after SPARK-19583

As we discussed in that [PR|https://github.com/apache/spark/pull/16938] 

The following DDL for hive table with an existed default location should throw 
an exception:
{code}
CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
{code}
Currently it will success for this situation


> create managed table for hive tables with an existed default location should 
> throw an exception
> ---
>
> Key: SPARK-19724
> URL: https://issues.apache.org/jira/browse/SPARK-19724
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> This JIRA is a follow up work after 
> [SPARK-19583](https://issues.apache.org/jira/browse/SPARK-19583)
> As we discussed in that [PR](https://github.com/apache/spark/pull/16938)
> The following DDL for a managed table with an existed default location should 
> throw an exception:
> {code}
> CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
> CREATE TABLE ... (PARTITIONED BY ...)
> {code}
> Currently there are some situations which are not consist with above logic:
> 1. CREATE TABLE ... (PARTITIONED BY ...) succeed with an existed default 
> location
> situation: for both hive/datasource(with HiveExternalCatalog/InMemoryCatalog)
> 2. CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
> situation: hive table succeed with an existed default location



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19724) create managed table for hive tables with an existed default location should throw an exception

2017-02-24 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19724:
-
Summary: create managed table for hive tables with an existed default 
location should throw an exception  (was: create table for hive tables with an 
existed default location should throw an exception)

> create managed table for hive tables with an existed default location should 
> throw an exception
> ---
>
> Key: SPARK-19724
> URL: https://issues.apache.org/jira/browse/SPARK-19724
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> This JIRA is a follow up work after SPARK-19583
> As we discussed in that [PR|https://github.com/apache/spark/pull/16938] 
> The following DDL for hive table with an existed default location should 
> throw an exception:
> {code}
> CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
> {code}
> Currently it will success for this situation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19724) create table for hive tables with an existed default location should throw an exception

2017-02-24 Thread Song Jun (JIRA)
Song Jun created SPARK-19724:


 Summary: create table for hive tables with an existed default 
location should throw an exception
 Key: SPARK-19724
 URL: https://issues.apache.org/jira/browse/SPARK-19724
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


This JIRA is a follow up work after SPARK-19583

As we discussed in that [PR|https://github.com/apache/spark/pull/16938] 

The following DDL for hive table with an existed default location should throw 
an exception:
{code}
CREATE TABLE ... (PARTITIONED BY ...) AS SELECT ...
{code}
Currently it will success for this situation



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19723) create table for data source tables should work with an non-existent location

2017-02-23 Thread Song Jun (JIRA)
Song Jun created SPARK-19723:


 Summary: create table for data source tables should work with an 
non-existent location
 Key: SPARK-19723
 URL: https://issues.apache.org/jira/browse/SPARK-19723
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


This JIRA is a follow up work after SPARK-19583

As we discussed in that [PR|https://github.com/apache/spark/pull/16938] 

The following DDL for datasource table with an non-existent location should 
work:
``
CREATE TABLE ... (PARTITIONED BY ...) LOCATION path
```
Currently it will throw exception  that path not exists



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19667) Create table with HiveEnabled in default database use warehouse path instead of the location of default database

2017-02-20 Thread Song Jun (JIRA)
Song Jun created SPARK-19667:


 Summary: Create table with HiveEnabled in default database use 
warehouse path instead of the location of default database
 Key: SPARK-19667
 URL: https://issues.apache.org/jira/browse/SPARK-19667
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.1.0
Reporter: Song Jun


Currently, when we create a managed table with HiveEnabled in default database, 
 Spark will use the location of default database as the table's location, this 
is ok in non-shared metastore.

While if we use a shared metastore between different clusters, for example,
1) there is a hive metastore in Cluster-A, and the metastore use a remote mysql 
as its db, and create a default database in metastore, then the location of the 
default database is the path in Cluster-A

2) then we create another Cluster-B, and Cluster-B also use the same remote 
mysql as its metastore's db, so the default database conf in Cluster-B download 
from mysql, which location is the path of Cluster-A

3) then we create a table in Cluster-B in default database, it will throw an 
exception, that UnknowHost Cluster-A

In Hive2.0.0, it is allowed to create a table in default database which shared 
between clusters , and this action is not allowed in other database, just for 
default.

As a spark User, we will want to have the same action as Hive, thus we can 
create table in default databse



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19664) put 'hive.metastore.warehouse.dir' in hadoopConf place

2017-02-19 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19664:
-
Description: 
In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in the 
logic, when use the value of  'spark.sql.warehouse.dir' to overwrite 
'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it 
should put in 'sparkContext.hadoopConfiguration' and overwrite the original 
value of hadoopConf

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64

  was:
In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in the 
logic, when use the value of  'spark.sql.warehouse.dir' to overwrite 
'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it 
should put in 'sparkContext.hadoopConfiguration'

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64


> put 'hive.metastore.warehouse.dir' in hadoopConf place
> --
>
> Key: SPARK-19664
> URL: https://issues.apache.org/jira/browse/SPARK-19664
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Song Jun
>Priority: Minor
>
> In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in 
> the logic, when use the value of  'spark.sql.warehouse.dir' to overwrite 
> 'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it 
> should put in 'sparkContext.hadoopConfiguration' and overwrite the original 
> value of hadoopConf
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19664) put 'hive.metastore.warehouse.dir' in hadoopConf place

2017-02-19 Thread Song Jun (JIRA)
Song Jun created SPARK-19664:


 Summary: put 'hive.metastore.warehouse.dir' in hadoopConf place
 Key: SPARK-19664
 URL: https://issues.apache.org/jira/browse/SPARK-19664
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Song Jun
Priority: Minor


In SPARK-15959, we bring back the 'hive.metastore.warehouse.dir' , while in the 
logic, when use the value of  'spark.sql.warehouse.dir' to overwrite 
'spark.sql.warehouse.dir' , it set it to 'sparkContext.conf' , I think it 
should put in 'sparkContext.hadoopConfiguration'

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L64



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19598) Remove the alias parameter in UnresolvedRelation

2017-02-16 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869623#comment-15869623
 ] 

Song Jun commented on SPARK-19598:
--

Thanks~ let me investigate more~

> Remove the alias parameter in UnresolvedRelation
> 
>
> Key: SPARK-19598
> URL: https://issues.apache.org/jira/browse/SPARK-19598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> UnresolvedRelation has a second argument named "alias", for assigning the 
> relation an alias. I think we can actually remove it and replace its use with 
> a SubqueryAlias.
> This would actually simplify some analyzer code to only match on 
> SubqueryAlias. For example, the broadcast hint pull request can have one 
> fewer case https://github.com/apache/spark/pull/16925/files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19598) Remove the alias parameter in UnresolvedRelation

2017-02-15 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867548#comment-15867548
 ] 

Song Jun edited comment on SPARK-19598 at 2/15/17 9:46 AM:
---

[~rxin] When I do this jira, I found it that it is not proper to remove 
UnresolvedRelation's alias argument. If we remove it and use SubqueryAlias, 
that is:

{quote}SubqueryAlias(alias,UnresolvedRelation(tableIdentifier),None){quote}

to replace 

{quote}UnresolvedRelation(tableIdentifier, alias){quote}

While there are lots of  *match case* codes for *UnresolvedRelation*,  and in 
matched logic it will use the alias parameter of *UnresolvedRelation*, 
currently table with/without alias can processed in one *match case 
UnresolvedRelation* logic, after this change, we should process table with 
alias and without alias seperately in two *match case*:
{quote}
case u:UnresolvedRelation => func(u,None)
case s@SubqueryAlias(alias,u:UnresolvedRelation,_) => func(u, alias)
{quote}
 
such as:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L589-L591

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L626

Is this right? Or am I missing something?


was (Author: windpiger):
[~rxin] When I do this jira, I found it that it is not proper to remove 
UnresolvedRelation's alias argument. If we remove it and use SubqueryAlias, 
that is:

{quote}SubqueryAlias(alias,UnresolvedRelation(tableIdentifier),None){quote}

to replace 

{quote}UnresolvedRelation(tableIdentifier, aliase){quote}

While there are lots of  *match case* codes for *UnresolvedRelation*,  and in 
matched logic it will use the alias parameter of *UnresolvedRelation*, 
currently table with/without alias can processed in one *match case 
UnresolvedRelation* logic, after this change, we should process table with 
alias and without alias seperately in two *match case*:
{quote}
case u:UnresolvedRelation => func(u,None)
case s@SubqueryAlias(alias,u:UnresolvedRelation,_) => func(u, alias)
{quote}
 
such as:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L589-L591

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L626

Is this right? Or am I missing something?

> Remove the alias parameter in UnresolvedRelation
> 
>
> Key: SPARK-19598
> URL: https://issues.apache.org/jira/browse/SPARK-19598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> UnresolvedRelation has a second argument named "alias", for assigning the 
> relation an alias. I think we can actually remove it and replace its use with 
> a SubqueryAlias.
> This would actually simplify some analyzer code to only match on 
> SubqueryAlias. For example, the broadcast hint pull request can have one 
> fewer case https://github.com/apache/spark/pull/16925/files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19598) Remove the alias parameter in UnresolvedRelation

2017-02-15 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867548#comment-15867548
 ] 

Song Jun commented on SPARK-19598:
--

[~rxin] When I do this jira, I found it that it is not proper to remove 
UnresolvedRelation's alias argument. If we remove it and use SubqueryAlias, 
that is:

{quote}SubqueryAlias(alias,UnresolvedRelation(tableIdentifier),None){quote}

to replace 

{quote}UnresolvedRelation(tableIdentifier, aliase){quote}

While there are lots of  *match case* codes for *UnresolvedRelation*,  and in 
matched logic it will use the alias parameter of *UnresolvedRelation*, 
currently table with/without alias can processed in one *match case 
UnresolvedRelation* logic, after this change, we should process table with 
alias and without alias seperately in two *match case*:
{quote}
case u:UnresolvedRelation => func(u,None)
case s@SubqueryAlias(alias,u:UnresolvedRelation,_) => func(u, alias)
{quote}
 
such as:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L589-L591

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L626

Is this right? Or am I missing something?

> Remove the alias parameter in UnresolvedRelation
> 
>
> Key: SPARK-19598
> URL: https://issues.apache.org/jira/browse/SPARK-19598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> UnresolvedRelation has a second argument named "alias", for assigning the 
> relation an alias. I think we can actually remove it and replace its use with 
> a SubqueryAlias.
> This would actually simplify some analyzer code to only match on 
> SubqueryAlias. For example, the broadcast hint pull request can have one 
> fewer case https://github.com/apache/spark/pull/16925/files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19598) Remove the alias parameter in UnresolvedRelation

2017-02-14 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867398#comment-15867398
 ] 

Song Jun commented on SPARK-19598:
--

OK~ I'd like to do this. Thank you very much!

> Remove the alias parameter in UnresolvedRelation
> 
>
> Key: SPARK-19598
> URL: https://issues.apache.org/jira/browse/SPARK-19598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> UnresolvedRelation has a second argument named "alias", for assigning the 
> relation an alias. I think we can actually remove it and replace its use with 
> a SubqueryAlias.
> This would actually simplify some analyzer code to only match on 
> SubqueryAlias. For example, the broadcast hint pull request can have one 
> fewer case https://github.com/apache/spark/pull/16925/files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19166) change method name from InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix

2017-02-14 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun closed SPARK-19166.

Resolution: Not A Bug

minor issue

> change method name from 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix
> 
>
> Key: SPARK-19166
> URL: https://issues.apache.org/jira/browse/SPARK-19166
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Song Jun
>Priority: Minor
>
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions delete all files 
> that match a static prefix, such as a partition file path(/table/foo=1), or a 
> no partition file path(/xxx/a.json).
> while the method name deleteMatchingPartitions indicates that only the 
> partition file will be deleted. This name make a confused.
> It is better to rename the method name.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19491) add a config for tableRelation cache size in SessionCatalog

2017-02-14 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun closed SPARK-19491.

Resolution: Duplicate

duplicate with https://github.com/apache/spark/pull/16736

> add a config for tableRelation cache size in SessionCatalog
> ---
>
> Key: SPARK-19491
> URL: https://issues.apache.org/jira/browse/SPARK-19491
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> currently the table relation cache size is hardcode to 1000, it is better to 
> add a config to set its size.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19484) continue work to create a table with an empty schema

2017-02-14 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun closed SPARK-19484.

Resolution: Won't Fix

this has been contained in https://github.com/apache/spark/pull/16787

> continue work to create a table with an empty schema
> 
>
> Key: SPARK-19484
> URL: https://issues.apache.org/jira/browse/SPARK-19484
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Priority: Minor
>
> after SPARK-19279, we could not create a Hive table with an empty schema,
> we should tighten up the condition when create a hive table in 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835
> That is if a CatalogTable t has an empty schema, and (there is no 
> `spark.sql.schema.numParts` or its value is 0), we should not add a default 
> `col` schema, if we did, a table with an empty schema will be created, that 
> is not we expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19583) CTAS for data source tables with an created location does not work

2017-02-13 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865111#comment-15865111
 ] 

Song Jun commented on SPARK-19583:
--

ok, I'd like to take this one, thanks a lot!

> CTAS for data source tables with an created location does not work
> --
>
> Key: SPARK-19583
> URL: https://issues.apache.org/jira/browse/SPARK-19583
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>
> {noformat}
> spark.sql(
>   s"""
>  |CREATE TABLE t
>  |USING parquet
>  |PARTITIONED BY(a, b)
>  |LOCATION '$dir'
>  |AS SELECT 3 as a, 4 as b, 1 as c, 2 as d
>""".stripMargin)
> {noformat}
> Failed with the error message:
> {noformat}
> path 
> file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4cgn/T/spark-195cd513-428a-4df9-b196-87db0c73e772
>  already exists.;
> org.apache.spark.sql.AnalysisException: path 
> file:/private/var/folders/6r/15tqm8hn3ldb3rmbfqm1gf4cgn/T/spark-195cd513-428a-4df9-b196-87db0c73e772
>  already exists.;
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:102)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19577) insert into a partition datasource table with InMemoryCatalog after the partition location alter by alter command failed

2017-02-13 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863511#comment-15863511
 ] 

Song Jun commented on SPARK-19577:
--

I am working on this~

> insert into a partition datasource table with InMemoryCatalog after the 
> partition location alter by alter command failed
> 
>
> Key: SPARK-19577
> URL: https://issues.apache.org/jira/browse/SPARK-19577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> If we use InMemoryCatalog, then we insert into a partition datasource table, 
> which partition location has changed by `alter table t partition(a="xx") set 
> location $newpath`, the insert operation is ok, and the data can be insert 
> into $newpath, while if we then select partition from the table, it will not 
> return the value we inserted.
> The reason is that the InMemoryFileIndex to inferPartition by the table's 
> rootPath, it does not track the user specific $newPath which provided by 
> alter command.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19575) Reading from or writing to a hive serde table with a non pre-existing location should succeed

2017-02-13 Thread Song Jun (JIRA)
Song Jun created SPARK-19575:


 Summary: Reading from or writing to a hive serde table with a non 
pre-existing location should succeed
 Key: SPARK-19575
 URL: https://issues.apache.org/jira/browse/SPARK-19575
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


currently when we select from a hive serde table which has a non pre-existing 
location will throw an exception:

```
Input path does not exist: file:/tmp/spark-37caa4e6-5a6a-4361-a905-06cc56afb274
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/tmp/spark-37caa4e6-5a6a-4361-a905-06cc56afb274
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:194)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2080)
at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:258)
```

this is a folllowup work from SPARK-19329 which has unify the action when we 
reading from or writing to a datasource table with a non pre-existing locaiton, 
so here we should also unify the hive serde tables



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19577) insert into a partition datasource table with InMemoryCatalog after the partition location alter by alter command failed

2017-02-13 Thread Song Jun (JIRA)
Song Jun created SPARK-19577:


 Summary: insert into a partition datasource table with 
InMemoryCatalog after the partition location alter by alter command failed
 Key: SPARK-19577
 URL: https://issues.apache.org/jira/browse/SPARK-19577
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


If we use InMemoryCatalog, then we insert into a partition datasource table, 
which partition location has changed by `alter table t partition(a="xx") set 
location $newpath`, the insert operation is ok, and the data can be insert into 
$newpath, while if we then select partition from the table, it will not return 
the value we inserted.

The reason is that the InMemoryFileIndex to inferPartition by the table's 
rootPath, it does not track the user specific $newPath which provided by alter 
command.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19558) Provide a config option to attach QueryExecutionListener to SparkSession

2017-02-12 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862768#comment-15862768
 ] 

Song Jun commented on SPARK-19558:
--

sparkSession.listenerManager.register is not enough?

> Provide a config option to attach QueryExecutionListener to SparkSession
> 
>
> Key: SPARK-19558
> URL: https://issues.apache.org/jira/browse/SPARK-19558
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Salil Surendran
>
> Provide a configuration property(just like spark.extraListeners) to attach a 
> QueryExecutionListener to a SparkSession



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19496) to_date with format has weird behavior

2017-02-07 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857570#comment-15857570
 ] 

Song Jun commented on SPARK-19496:
--

[~hyukjin.kwon]  

> to_date with format has weird behavior
> --
>
> Key: SPARK-19496
> URL: https://issues.apache.org/jira/browse/SPARK-19496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wenchen Fan
>
> Today, if we run
> {code}
> SELECT to_date('2015-07-22', '-dd-MM')
> {code}
> will result to `2016-10-07`, while running
> {code}
> SELECT to_date('2014-31-12')   # default format
> {code}
> will return null.
> this behavior is weird and we should check other systems like hive to see if 
> this is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19496) to_date with format has weird behavior

2017-02-07 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857541#comment-15857541
 ] 

Song Jun edited comment on SPARK-19496 at 2/8/17 7:18 AM:
--

mysql: select str_to_date('2014-12-31','%Y-%d-%m')   also return null

that is mysql both return null when the date is invalidate or the formate is 
invalidate.

and hive will transform the invalidate  date to  valid, e.g 2014-31-12 -> 31/12 
= 2 -> 2014+2=2016
, 31 - 12*2=7 -> 2016-07-12

currently spark can handle wrong format / wrong date  when to_date has the 
format parameter (like hive's transform), what about we also make to_date 
without format parameter follow its action, that is replace null with a 
transformed date to return


was (Author: windpiger):
mysql: select str_to_date('2014-12-31','%Y-%d-%m')   also return null

that is mysql both return null when the date is invalidate or the formate is 
invalidate.

and hive will transform the invalidate  date to  valid, e.g 2014-31-12 -> 31/12 
= 2 -> 2014+2=2016
, 31 - 12*2=7 -> 2016-07-12

> to_date with format has weird behavior
> --
>
> Key: SPARK-19496
> URL: https://issues.apache.org/jira/browse/SPARK-19496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wenchen Fan
>
> Today, if we run
> {code}
> SELECT to_date('2015-07-22', '-dd-MM')
> {code}
> will result to `2016-10-07`, while running
> {code}
> SELECT to_date('2014-31-12')   # default format
> {code}
> will return null.
> this behavior is weird and we should check other systems like hive to see if 
> this is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19496) to_date with format has weird behavior

2017-02-07 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857541#comment-15857541
 ] 

Song Jun edited comment on SPARK-19496 at 2/8/17 7:11 AM:
--

mysql: select str_to_date('2014-12-31','%Y-%d-%m')   also return null

that is mysql both return null when the date is invalidate or the formate is 
invalidate.

and hive will transform the invalidate  date to  valid, e.g 2014-31-12 -> 31/12 
= 2 -> 2014+2=2016
, 31 - 12*2=7 -> 2016-07-12


was (Author: windpiger):
mysql: select str_to_date('2014-12-31','%Y-%d-%m')   also return null

mysql both return null when the date is invalidate or the formate is invalidate

> to_date with format has weird behavior
> --
>
> Key: SPARK-19496
> URL: https://issues.apache.org/jira/browse/SPARK-19496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wenchen Fan
>
> Today, if we run
> {code}
> SELECT to_date('2015-07-22', '-dd-MM')
> {code}
> will result to `2016-10-07`, while running
> {code}
> SELECT to_date('2014-31-12')   # default format
> {code}
> will return null.
> this behavior is weird and we should check other systems like hive to see if 
> this is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19496) to_date with format has weird behavior

2017-02-07 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857541#comment-15857541
 ] 

Song Jun edited comment on SPARK-19496 at 2/8/17 7:09 AM:
--

mysql: select str_to_date('2014-12-31','%Y-%d-%m')   also return null

mysql both return null when the date is invalidate or the formate is invalidate


was (Author: windpiger):
mysql: select str_to_date('2014-12-31','%Y-%d-%m')   also return null


> to_date with format has weird behavior
> --
>
> Key: SPARK-19496
> URL: https://issues.apache.org/jira/browse/SPARK-19496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wenchen Fan
>
> Today, if we run
> {code}
> SELECT to_date('2015-07-22', '-dd-MM')
> {code}
> will result to `2016-10-07`, while running
> {code}
> SELECT to_date('2014-31-12')   # default format
> {code}
> will return null.
> this behavior is weird and we should check other systems like hive to see if 
> this is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19496) to_date with format has weird behavior

2017-02-07 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15857541#comment-15857541
 ] 

Song Jun commented on SPARK-19496:
--

mysql: select str_to_date('2014-12-31','%Y-%d-%m')   also return null


> to_date with format has weird behavior
> --
>
> Key: SPARK-19496
> URL: https://issues.apache.org/jira/browse/SPARK-19496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wenchen Fan
>
> Today, if we run
> {code}
> SELECT to_date('2015-07-22', '-dd-MM')
> {code}
> will result to `2016-10-07`, while running
> {code}
> SELECT to_date('2014-31-12')   # default format
> {code}
> will return null.
> this behavior is weird and we should check other systems like hive to see if 
> this is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19496) to_date with format has weird behavior

2017-02-07 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15856205#comment-15856205
 ] 

Song Jun commented on SPARK-19496:
--

I am working on this~

> to_date with format has weird behavior
> --
>
> Key: SPARK-19496
> URL: https://issues.apache.org/jira/browse/SPARK-19496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wenchen Fan
>
> Today, if we run
> {code}
> SELECT to_date('2015-07-22', '-dd-MM')
> {code}
> will result to `2016-10-07`, while running
> {code}
> SELECT to_date('2014-31-12')   # default format
> {code}
> will return null.
> this behavior is weird and we should check other systems like hive to see if 
> this is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19491) add a config for tableRelation cache size in SessionCatalog

2017-02-07 Thread Song Jun (JIRA)
Song Jun created SPARK-19491:


 Summary: add a config for tableRelation cache size in 
SessionCatalog
 Key: SPARK-19491
 URL: https://issues.apache.org/jira/browse/SPARK-19491
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


currently the table relation cache size is hardcode to 1000, it is better to 
add a config to set its size.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns

2017-02-07 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855677#comment-15855677
 ] 

Song Jun commented on SPARK-19477:
--

thanks, I got it~

> [SQL] Datasets created from a Dataframe with extra columns retain the extra 
> columns
> ---
>
> Key: SPARK-19477
> URL: https://issues.apache.org/jira/browse/SPARK-19477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Don Drake
>
> In 1.6, when you created a Dataset from a Dataframe that had extra columns, 
> the columns not in the case class were dropped from the Dataset.
> For example in 1.6, the column c4 is gone:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import sqlContext.implicits._
> import sqlContext.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: 
> string]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]
> scala> ds.show
> +---+---+---+
> | f1| f2| f3|
> +---+---+---+
> |  a|  b|  c|
> |  d|  e|  f|
> |  h|  i|  j|
> {code}
> This seems to have changed in Spark 2.0 and also 2.1:
> Spark 2.1.0:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more 
> fields]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more 
> fields]
> scala> ds.show
> +---+---+---+---+
> | f1| f2| f3| c4|
> +---+---+---+---+
> |  a|  b|  c|  x|
> |  d|  e|  f|  y|
> |  h|  i|  j|  z|
> +---+---+---+---+
> scala> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Encoders
> scala> val fEncoder = Encoders.product[F]
> fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: 
> string, f3[0]: string]
> scala> fEncoder.schema == ds.schema
> res2: Boolean = false
> scala> ds.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true), StructField(c4,StringType,true))
> scala> fEncoder.schema
> res4: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19484) continue work to create a table with an empty schema

2017-02-06 Thread Song Jun (JIRA)
Song Jun created SPARK-19484:


 Summary: continue work to create a table with an empty schema
 Key: SPARK-19484
 URL: https://issues.apache.org/jira/browse/SPARK-19484
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


after SPARK-19279, we could not create a Hive table with an empty schema,
we should tighten up the condition when create a hive table in 

https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L835

That is if a CatalogTable t has an empty schema, and (there is no 
`spark.sql.schema.numParts` or its value is 0), we should not add a default 
`col` schema, if we did, a table with an empty schema will be created, that is 
not we expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19430) Cannot read external tables with VARCHAR columns if they're backed by ORC files written by Hive 1.2.1

2017-02-05 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853466#comment-15853466
 ] 

Song Jun commented on SPARK-19430:
--

I think this is not a bug. If you want to access the hive table ,you can 
directly use
`
spark.table("orc_varchar_test").show 
`

> Cannot read external tables with VARCHAR columns if they're backed by ORC 
> files written by Hive 1.2.1
> -
>
> Key: SPARK-19430
> URL: https://issues.apache.org/jira/browse/SPARK-19430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0
>Reporter: Sameer Agarwal
>
> Spark throws an exception when trying to read external tables with VARCHAR 
> columns if they're backed by ORC files that were written by Hive 1.2.1 (and 
> possibly other versions of hive).
> Steps to reproduce (credits to [~lian cheng]):
> # Write an ORC table using Hive 1.2.1 with
>{noformat}
> CREATE TABLE orc_varchar_test STORED AS ORC
> AS SELECT CASTE('a' AS VARCHAR(10)) AS c0{noformat}
> # Get the raw path of the written ORC file
> # Create an external table pointing to this file and read the table using 
> Spark
>   {noformat}
> val path = "/tmp/orc_varchar_test"
> sql(s"create external table if not exists test (c0 varchar(10)) stored as orc 
> location '$path'")
> spark.table("test").show(){noformat}
> The problem here is that the metadata in the ORC file written by Hive is 
> different from those written by Spark. We can inspect the ORC file written 
> above:
> {noformat}
> $ hive --orcfiledump 
> file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/00_0
> Structure for 
> file:///Users/lian/local/var/lib/hive/warehouse_1.2.1/orc_varchar_test/00_0
> File Version: 0.12 with HIVE_8732
> Rows: 1
> Compression: ZLIB
> Compression size: 262144
> Type: struct<_col0:varchar(10)>   <
> ...
> {noformat}
> On the other hand, if you create an ORC table using the same DDL and inspect 
> the written ORC file, you'll see:
> {noformat}
> ...
> Type: struct
> ...
> {noformat}
> Note that all tests are done with {{spark.sql.hive.convertMetastoreOrc}} set 
> to {{false}}, which is the default case.
> I've verified that Spark 1.6.x, 2.0.x and 2.1.x all fail with instances of 
> the following error:
> {code}
> java.lang.ClassCastException: 
> org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
> org.apache.hadoop.io.Text
> at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
> at 
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:529)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
> at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19447) Fix input metrics for range operator

2017-02-05 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853462#comment-15853462
 ] 

Song Jun commented on SPARK-19447:
--

spark.range(1,100).show

there is some information in the SQL UI like:
`
Range
number of output rows: 99
`

I didn't see some information like `0 rows`

maybe I didn't get the right place. could you help to describe it more clearly?

thanks!

> Fix input metrics for range operator
> 
>
> Key: SPARK-19447
> URL: https://issues.apache.org/jira/browse/SPARK-19447
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Reynold Xin
>
> Range operator currently does not output any input metrics, and as a result 
> in the SQL UI the number of rows shown is always 0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19463) refresh the table cache after InsertIntoHadoopFsRelation

2017-02-05 Thread Song Jun (JIRA)
Song Jun created SPARK-19463:


 Summary: refresh the table cache after InsertIntoHadoopFsRelation
 Key: SPARK-19463
 URL: https://issues.apache.org/jira/browse/SPARK-19463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


If we first cache a DataSource table, then we insert some data into the table, 
we should refresh the data in the cache after the insert command. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19458) loading hive jars from the local repo which has already downloaded

2017-02-04 Thread Song Jun (JIRA)
Song Jun created SPARK-19458:


 Summary: loading hive jars from the local repo which has already 
downloaded
 Key: SPARK-19458
 URL: https://issues.apache.org/jira/browse/SPARK-19458
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


Currently when we new a HiveClient for a specific metastore version and 
`spark.sql.hive.metastore.jars` is setted to `maven`, Spark will download the 
hive jars from remote repo(http://www.datanucleus.org/downloads/maven2).

we should allow the user to load hive jars from the local repo which has 
already downloaded.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19448) unify some duplication function in MetaStoreRelation

2017-02-03 Thread Song Jun (JIRA)
Song Jun created SPARK-19448:


 Summary: unify some duplication function in MetaStoreRelation
 Key: SPARK-19448
 URL: https://issues.apache.org/jira/browse/SPARK-19448
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun
Priority: Minor


1. MetaStoreRelation' hiveQlTable can be replaced by calling HiveClientImpl's 
toHiveTable
2. MetaStoreRelation's toHiveColumn can be replaced by calling HiveClientImpl's 
toHiveColumn
3. process another TODO
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala#L234



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19340) Opening a file in CSV format will result in an exception if the filename contains special characters

2017-01-25 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837407#comment-15837407
 ] 

Song Jun commented on SPARK-19340:
--

the reason is that spark sql treat the test{00-1}.txt as a globpath.
we can not put a file name like text{00-1}.txt to hdfs, it will throw an 
exception.

I think this is not a bug

> Opening a file in CSV format will result in an exception if the filename 
> contains special characters
> 
>
> Key: SPARK-19340
> URL: https://issues.apache.org/jira/browse/SPARK-19340
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.1.0, 2.2.0
>Reporter: Reza Safi
>Priority: Minor
>
> If you want to open a file that its name is like  {noformat} "*{*}*.*" 
> {noformat} or {noformat} "*[*]*.*" {noformat} using CSV format, you will get 
> the "org.apache.spark.sql.AnalysisException: Path does not exist" whether the 
> file is a local file or on hdfs.
> This bug can be reproduced on master and all other Spark 2 branches.
> To reproduce:
> # Create a file like "test{00-1}.txt" on a local directory (like in 
> /Users/reza/test/test{00-1}.txt)
> # Run spark-shell
> # Execute this command:
> {noformat}
> val df=spark.read.option("header","false").csv("/Users/reza/test/*.txt")
> {noformat}
> You will see the following stack trace:
> {noformat}
> org.apache.spark.sql.AnalysisException: Path does not exist: 
> file:/Users/reza/test/test\{00-01\}.txt;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:360)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:360)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.readText(CSVFileFormat.scala:208)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:63)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:174)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:174)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:158)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:423)
>   at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:360)
>   ... 48 elided
> {noformat}
> If you put the file on hadoop (like on /user/root) when you try to run the 
> following:
> {noformat}
> val df=spark.read.option("header", false).csv("/user/root/*.txt")
> {noformat}
>  
> You will get the following exception:
> {noformat}
> org.apache.hadoop.mapred.InvalidInputException: Input Pattern 
> hdfs://hosturl/user/root/test\{00-01\}.txt matches 0 files
>   at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 

[jira] [Updated] (SPARK-19359) partition path created by Hive should be deleted after rename a partition with upper-case

2017-01-24 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19359:
-
Issue Type: Improvement  (was: Bug)

> partition path created by Hive should be deleted after rename a partition 
> with upper-case
> -
>
> Key: SPARK-19359
> URL: https://issues.apache.org/jira/browse/SPARK-19359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Song Jun
>Priority: Minor
>
> Hive metastore is not case preserving and keep partition columns with lower 
> case names. 
> If SparkSQL create a table with upper-case partion name use 
> HiveExternalCatalog, when we rename partition, it first call the HiveClient 
> to renamePartition, which will create a new lower case partition path, then 
> SparkSql rename the lower case path to the upper-case.
> while if the renamed partition contains more than one depth partition ,e.g. 
> A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to 
> A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19359) partition path created by Hive should be deleted after rename a partition with upper-case

2017-01-24 Thread Song Jun (JIRA)
Song Jun created SPARK-19359:


 Summary: partition path created by Hive should be deleted after 
rename a partition with upper-case
 Key: SPARK-19359
 URL: https://issues.apache.org/jira/browse/SPARK-19359
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Song Jun
Priority: Minor


Hive metastore is not case preserving and keep partition columns with lower 
case names. 

If SparkSQL create a table with upper-case partion name use 
HiveExternalCatalog, when we rename partition, it first call the HiveClient to 
renamePartition, which will create a new lower case partition path, then 
SparkSql rename the lower case path to the upper-case.

while if the renamed partition contains more than one depth partition ,e.g. 
A=1/B=2, hive renamePartition change to a=1/b=2, then SparkSql rename it to 
A=1/B=2, but the a=1 still exists in the filesystem, we should also delete it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19332) table's location should check if a URI is legal

2017-01-22 Thread Song Jun (JIRA)
Song Jun created SPARK-19332:


 Summary: table's location should check if a URI is legal
 Key: SPARK-19332
 URL: https://issues.apache.org/jira/browse/SPARK-19332
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Song Jun


~SPARK-19257 ‘s work is to change the type of  `CatalogStorageFormat` 's 
locationUri to `URI`, while it has some problem:

1.`CatalogTable` and `CatalogTablePartition` use the same class 
`CatalogStorageFormat`
2. the type URI is ok for `CatalogTable`, but it is not proper for 
`CatalogTablePartition`
3. the location of a table partition can contains a not encode whitespace, so 
  if a partition location contains this not encode whitespace, and it will 
throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a 
partition location which has whitespace

so if we change the type to URI, it is bad for `CatalogTablePartition`

and I found Hive has the same issue ~HIVE-6185
before hive 0.13 the location is URI, while after above PR, it change it to 
Path, and do some check when DDL.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

so I think ,we can do the URI check for the table's location , and it is not 
proper to change the type to URI.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19332) table's location should check if a URI is legal

2017-01-22 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19332:
-
Description: 
SPARK-19257 ‘s work is to change the type of  `CatalogStorageFormat` 's 
locationUri to `URI`, while it has some problem:

1.`CatalogTable` and `CatalogTablePartition` use the same class 
`CatalogStorageFormat`
2. the type URI is ok for `CatalogTable`, but it is not proper for 
`CatalogTablePartition`
3. the location of a table partition can contains a not encode whitespace, so 
  if a partition location contains this not encode whitespace, and it will 
throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a 
partition location which has whitespace

so if we change the type to URI, it is bad for `CatalogTablePartition`

and I found Hive has the same issue HIVE-6185
before hive 0.13 the location is URI, while after above PR, it change it to 
Path, and do some check when DDL.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

so I think ,we can do the URI check for the table's location , and it is not 
proper to change the type to URI.


  was:
~SPARK-19257 ‘s work is to change the type of  `CatalogStorageFormat` 's 
locationUri to `URI`, while it has some problem:

1.`CatalogTable` and `CatalogTablePartition` use the same class 
`CatalogStorageFormat`
2. the type URI is ok for `CatalogTable`, but it is not proper for 
`CatalogTablePartition`
3. the location of a table partition can contains a not encode whitespace, so 
  if a partition location contains this not encode whitespace, and it will 
throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a 
partition location which has whitespace

so if we change the type to URI, it is bad for `CatalogTablePartition`

and I found Hive has the same issue ~HIVE-6185
before hive 0.13 the location is URI, while after above PR, it change it to 
Path, and do some check when DDL.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

so I think ,we can do the URI check for the table's location , and it is not 
proper to change the type to URI.



> table's location should check if a URI is legal
> ---
>
> Key: SPARK-19332
> URL: https://issues.apache.org/jira/browse/SPARK-19332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Song Jun
>
> SPARK-19257 ‘s work is to change the type of  `CatalogStorageFormat` 's 
> locationUri to `URI`, while it has some problem:
> 1.`CatalogTable` and `CatalogTablePartition` use the same class 
> `CatalogStorageFormat`
> 2. the type URI is ok for `CatalogTable`, but it is not proper for 
> `CatalogTablePartition`
> 3. the location of a table partition can contains a not encode whitespace, so 
>   if a partition location contains this not encode whitespace, and it will 
> throw an exception for URI. for example `/path/2014-01-01 00%3A00%3A00` is a 
> partition location which has whitespace
> so if we change the type to URI, it is bad for `CatalogTablePartition`
> and I found Hive has the same issue HIVE-6185
> before hive 0.13 the location is URI, while after above PR, it change it to 
> Path, and do some check when DDL.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732
> so I think ,we can do the URI check for the table's location , and it is not 
> proper to change the type to URI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-01-22 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935
 ] 

Song Jun edited comment on SPARK-19257 at 1/22/17 11:22 AM:


I found it that `CatalogTablePartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a 
legal URI (no encoded whitespace) 

a)should we seperate the CatalogTableStorageFormat from CatalogTablePartition 
and CatalogTable?

b)or we just check if the ` locationUri: String` is illegal when make a DDL 
command, such as ` alter table set location xx` or `create table t options 
(path xx) `

Hive's implement is like b) :
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

HIVE-0.12 the table's location method's paramenter Type is URI, after that 
version change to Path
https://github.com/apache/hive/blob/branch-0.12/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java#L501

this is the same issue in hive: 
https://issues.apache.org/jira/browse/HIVE-6185

[~cloud_fan] [~smilegator]


was (Author: windpiger):
I found it that `CatalogTablePartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a 
legal URI (no encoded whitespace) 

a)should we seperate the CatalogTableStorageFormat from CatalogTablePartition 
and CatalogTable?

b)or we just check if the ` locationUri: String` is illegal when make a DDL 
command, such as ` alter table set location xx` or `create table t options 
(path xx) `

Hive's implement is like b) :
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

[~cloud_fan] [~smilegator]

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19329) after alter a datasource table's location to a not exist location and then insert data throw Exception

2017-01-22 Thread Song Jun (JIRA)
Song Jun created SPARK-19329:


 Summary: after alter a datasource table's location to a not exist 
location and then insert data throw Exception
 Key: SPARK-19329
 URL: https://issues.apache.org/jira/browse/SPARK-19329
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Song Jun


spark.sql("create table t(a string, b int) using parquet")
spark.sql(s"alter table t set location '$notexistedlocation'")
spark.sql("insert into table t select 'c', 1")

this will throw an exception:

com.google.common.util.concurrent.UncheckedExecutionException: 
org.apache.spark.sql.AnalysisException: Path does not exist: 
$notexistedlocation;
at 
com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122)
at 
org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at 
scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-01-21 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935
 ] 

Song Jun edited comment on SPARK-19257 at 1/22/17 6:41 AM:
---

I found it that `CatalogPartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a 
legal URI (no encoded whitespace) 

a)should we seperate the CatalogTableStorageFormat from CatalogPartition and 
CatalogTable?

b)or we just check if the ` locationUri: String` is illegal when make a DDL 
command, such as ` alter table set location xx` or `create table t options 
(path xx) `

Hive's implement is like b) :
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

[~cloud_fan] [~smilegator]


was (Author: windpiger):
I found it that `CatalogPartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a 
legal URI (no encoded whitespace) 

should we seperate the CatalogTableStorageFormat from CatalogPartition and 
CatalogTable?

[~cloud_fan] [~smilegator]

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-01-21 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935
 ] 

Song Jun edited comment on SPARK-19257 at 1/22/17 6:42 AM:
---

I found it that `CatalogTablePartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a 
legal URI (no encoded whitespace) 

a)should we seperate the CatalogTableStorageFormat from CatalogTablePartition 
and CatalogTable?

b)or we just check if the ` locationUri: String` is illegal when make a DDL 
command, such as ` alter table set location xx` or `create table t options 
(path xx) `

Hive's implement is like b) :
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

[~cloud_fan] [~smilegator]


was (Author: windpiger):
I found it that `CatalogPartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a 
legal URI (no encoded whitespace) 

a)should we seperate the CatalogTableStorageFormat from CatalogPartition and 
CatalogTable?

b)or we just check if the ` locationUri: String` is illegal when make a DDL 
command, such as ` alter table set location xx` or `create table t options 
(path xx) `

Hive's implement is like b) :
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java#L1553
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L3732

[~cloud_fan] [~smilegator]

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-01-21 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935
 ] 

Song Jun edited comment on SPARK-19257 at 1/22/17 6:00 AM:
---

I found it that `CatalogPartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a 
legal URI (no encoded whitespace) 

should we seperate the CatalogTableStorageFormat from CatalogPartition and 
CatalogTable?

[~cloud_fan] [~smilegator]


was (Author: windpiger):
I found it that `CatalogTablePartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a 
legal URI (no encoded whitespace) 

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-01-21 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935
 ] 

Song Jun edited comment on SPARK-19257 at 1/21/17 11:09 AM:


I found it that `CatalogTablePartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a 
legal URI (no encoded whitespace) 


was (Author: windpiger):
I found it than `CatalogTablePartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a 
legal URI (no encoded whitespace) 

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-01-21 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15832935#comment-15832935
 ] 

Song Jun commented on SPARK-19257:
--

I found it than `CatalogTablePartition` and `CatalogTable` use the same class 
`CatalogTableStorageFormat`.
if we change CatalogTableStorageFormat 's locationUri from `String` to `URI`, 
this will failed in `CatalogTablePartition` because table partition location 
can have whitespace, such as `2014-01-01 00%3A00%3A00`, while this is not a 
legal URI (no encoded whitespace) 

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19284) append to a existed partitioned datasource table should have no CustomPartitionLocations

2017-01-19 Thread Song Jun (JIRA)
Song Jun created SPARK-19284:


 Summary: append to a existed partitioned datasource table should 
have no CustomPartitionLocations
 Key: SPARK-19284
 URL: https://issues.apache.org/jira/browse/SPARK-19284
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Song Jun
Priority: Minor


when we append data to a existed partitioned datasource table, the 
InsertIntoHadoopFsRelationCommand.getCustomPartitionLocations currently
return the same location with Hive default, it should return None.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-01-17 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825645#comment-15825645
 ] 

Song Jun commented on SPARK-19257:
--

I am working on this~ thanks

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19246) CataLogTable's partitionSchema should check order

2017-01-16 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19246:
-
Summary: CataLogTable's partitionSchema should check order  (was: 
CataLogTable's partitionSchema should check if each column name in 
partitionColumnNames must match one and only one field in schema, and keep 
order with partitionColumnNames )

> CataLogTable's partitionSchema should check order
> ---
>
> Key: SPARK-19246
> URL: https://issues.apache.org/jira/browse/SPARK-19246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Song Jun
>
> get CataLogTable's partitionSchema should check if each column name in 
> partitionColumnNames must match one and only one field in schema, if not we 
> should throw an exception
> and CataLogTable's partitionSchema should keep order with 
> partitionColumnNames 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19246) CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, and keep order with partitionColumnN

2017-01-16 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19246:
-
Summary: CataLogTable's partitionSchema should check if each column name in 
partitionColumnNames must match one and only one field in schema, and keep 
order with partitionColumnNames   (was: CataLogTable's partitionSchema should 
check if each column name in partitionColumnNames must match one and only one 
field in schema)

> CataLogTable's partitionSchema should check if each column name in 
> partitionColumnNames must match one and only one field in schema, and keep 
> order with partitionColumnNames 
> --
>
> Key: SPARK-19246
> URL: https://issues.apache.org/jira/browse/SPARK-19246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Song Jun
>
> get CataLogTable's partitionSchema should check if each column name in 
> partitionColumnNames must match one and only one field in schema, if not we 
> should throw an exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19246) CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema, and keep order with partitionColumnN

2017-01-16 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-19246:
-
Description: 
get CataLogTable's partitionSchema should check if each column name in 
partitionColumnNames must match one and only one field in schema, if not we 
should throw an exception

and CataLogTable's partitionSchema should keep order with partitionColumnNames 

  was:get CataLogTable's partitionSchema should check if each column name in 
partitionColumnNames must match one and only one field in schema, if not we 
should throw an exception


> CataLogTable's partitionSchema should check if each column name in 
> partitionColumnNames must match one and only one field in schema, and keep 
> order with partitionColumnNames 
> --
>
> Key: SPARK-19246
> URL: https://issues.apache.org/jira/browse/SPARK-19246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Song Jun
>
> get CataLogTable's partitionSchema should check if each column name in 
> partitionColumnNames must match one and only one field in schema, if not we 
> should throw an exception
> and CataLogTable's partitionSchema should keep order with 
> partitionColumnNames 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19246) CataLogTable's partitionSchema should check if each column name in partitionColumnNames must match one and only one field in schema

2017-01-16 Thread Song Jun (JIRA)
Song Jun created SPARK-19246:


 Summary: CataLogTable's partitionSchema should check if each 
column name in partitionColumnNames must match one and only one field in schema
 Key: SPARK-19246
 URL: https://issues.apache.org/jira/browse/SPARK-19246
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Song Jun


get CataLogTable's partitionSchema should check if each column name in 
partitionColumnNames must match one and only one field in schema, if not we 
should throw an exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19153) DataFrameWriter.saveAsTable should work with hive format to create partitioned table

2017-01-16 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823657#comment-15823657
 ] 

Song Jun commented on SPARK-19153:
--

I am sorry, It is my fault,I forget to comment

> DataFrameWriter.saveAsTable should work with hive format to create 
> partitioned table
> 
>
> Key: SPARK-19153
> URL: https://issues.apache.org/jira/browse/SPARK-19153
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19241) remove hive generated table properties if they are not useful in Spark

2017-01-16 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823654#comment-15823654
 ] 

Song Jun commented on SPARK-19241:
--

I am working on this~

> remove hive generated table properties if they are not useful in Spark
> --
>
> Key: SPARK-19241
> URL: https://issues.apache.org/jira/browse/SPARK-19241
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> When we save a table into hive metastore, hive will generate some table 
> properties automatically. e.g. transient_lastDdlTime, last_modified_by, 
> rawDataSize, etc. Some of them are useless in Spark SQL, we should remove 
> them.
> It will be good if we can get the list of Hive-generated table properties via 
> Hive API, so that we don't need to hardcode them.
> We can take a look at Hive code to see how it excludes these auto-generated 
> table properties when describe table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19166) change method name from InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix

2017-01-10 Thread Song Jun (JIRA)
Song Jun created SPARK-19166:


 Summary: change method name from 
InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to 
InsertIntoHadoopFsRelationCommand.deleteMatchingPrefix
 Key: SPARK-19166
 URL: https://issues.apache.org/jira/browse/SPARK-19166
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Song Jun
Priority: Minor


InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions delete all files 
that match a static prefix, such as a partition file path(/table/foo=1), or a 
no partition file path(/xxx/a.json).

while the method name deleteMatchingPartitions indicates that only the 
partition file will be deleted. This name make a confused.

It is better to rename the method name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19154) support read and overwrite a same table

2017-01-10 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15816942#comment-15816942
 ] 

Song Jun commented on SPARK-19154:
--

I am working on this~

> support read and overwrite a same table
> ---
>
> Key: SPARK-19154
> URL: https://issues.apache.org/jira/browse/SPARK-19154
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> In SPARK-5746 , we forbid users to read and overwrite a same table. It seems 
> like we don't need this limitation now, we can remove the check and add 
> regression tests. We may need to take care of partitioned table though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.

2016-12-28 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784678#comment-15784678
 ] 

Song Jun edited comment on SPARK-18930 at 12/29/16 6:44 AM:


from hive document, 
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert

Note that the dynamic partition values are selected by ordering, not name, and 
taken as the last columns from the select clause.

and test it on hive also have the same logic as your description .

I think we can close this jira? [~srowen]


was (Author: windpiger):
from hive document, 
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert

Note that the dynamic partition values are selected by ordering, not name, and 
taken as the last columns from the select clause.

and test it on hive also have the same logic as your description .

I think we can close this jira?

> Inserting in partitioned table - partitioned field should be last in select 
> statement. 
> ---
>
> Key: SPARK-18930
> URL: https://issues.apache.org/jira/browse/SPARK-18930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> CREATE TABLE temp.test_partitioning_4 (
>   num string
>  ) 
> PARTITIONED BY (
>   day string)
>   stored as parquet
> INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day)
> select day, count(*) as num from 
> hss.session where year=2016 and month=4 
> group by day
> Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, 
> emp.db/test_partitioning_3/day=69094345
> As you can imagine these numbers are num of records. But! When I do select * 
> from  temp.test_partitioning_4 data is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18930) Inserting in partitioned table - partitioned field should be last in select statement.

2016-12-28 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15784678#comment-15784678
 ] 

Song Jun commented on SPARK-18930:
--

from hive document, 
https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert

Note that the dynamic partition values are selected by ordering, not name, and 
taken as the last columns from the select clause.

and test it on hive also have the same logic as your description .

I think we can close this jira?

> Inserting in partitioned table - partitioned field should be last in select 
> statement. 
> ---
>
> Key: SPARK-18930
> URL: https://issues.apache.org/jira/browse/SPARK-18930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> CREATE TABLE temp.test_partitioning_4 (
>   num string
>  ) 
> PARTITIONED BY (
>   day string)
>   stored as parquet
> INSERT INTO TABLE temp.test_partitioning_4 PARTITION (day)
> select day, count(*) as num from 
> hss.session where year=2016 and month=4 
> group by day
> Resulted schema on HDFS: /temp.db/test_partitioning_3/day=62456298, 
> emp.db/test_partitioning_3/day=69094345
> As you can imagine these numbers are num of records. But! When I do select * 
> from  temp.test_partitioning_4 data is correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18742) Clarify that user-defined BroadcastFactory is not supported

2016-12-16 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-18742:
-
Description: 
After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation of 
BroadcastFactory is TorrentBroadcastFactory and the spark.broadcast.factory 
conf has removed. 

however the scaladoc says [2]:

/**
 * An interface for all the broadcast implementations in Spark (to allow
 * multiple broadcast implementations). SparkContext uses a user-specified
 * BroadcastFactory implementation to instantiate a particular broadcast for the
 * entire Spark job.
 */

so we should modify the comment that SparkContext will not use a  
user-specified BroadcastFactory implementation

[1] https://issues.apache.org/jira/browse/SPARK-12588
[2] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30

  was:
After SPARK-12588 Remove HTTPBroadcast [1], the one and only
implementation of BroadcastFactory is TorrentBroadcastFactory. however
the scaladoc says [2]:

/**
 * An interface for all the broadcast implementations in Spark (to allow
 * multiple broadcast implementations). SparkContext uses a user-specified
 * BroadcastFactory implementation to instantiate a particular broadcast for the
 * entire Spark job.
 */

so we should modify the comment that SparkContext will not use a  
user-specified BroadcastFactory implementation

[1] https://issues.apache.org/jira/browse/SPARK-12588
[2] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30


> Clarify that user-defined BroadcastFactory is not supported
> ---
>
> Key: SPARK-18742
> URL: https://issues.apache.org/jira/browse/SPARK-18742
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Song Jun
>Priority: Trivial
>
> After SPARK-12588 Remove HTTPBroadcast [1], the one and only implementation 
> of BroadcastFactory is TorrentBroadcastFactory and the 
> spark.broadcast.factory conf has removed. 
> however the scaladoc says [2]:
> /**
>  * An interface for all the broadcast implementations in Spark (to allow
>  * multiple broadcast implementations). SparkContext uses a user-specified
>  * BroadcastFactory implementation to instantiate a particular broadcast for 
> the
>  * entire Spark job.
>  */
> so we should modify the comment that SparkContext will not use a  
> user-specified BroadcastFactory implementation
> [1] https://issues.apache.org/jira/browse/SPARK-12588
> [2] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18742) Clarify that user-defined BroadcastFactory is not supported

2016-12-16 Thread Song Jun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Jun updated SPARK-18742:
-
Description: 
After SPARK-12588 Remove HTTPBroadcast [1], the one and only
implementation of BroadcastFactory is TorrentBroadcastFactory. however
the scaladoc says [2]:

/**
 * An interface for all the broadcast implementations in Spark (to allow
 * multiple broadcast implementations). SparkContext uses a user-specified
 * BroadcastFactory implementation to instantiate a particular broadcast for the
 * entire Spark job.
 */

so we should modify the comment that SparkContext will not use a  
user-specified BroadcastFactory implementation

[1] https://issues.apache.org/jira/browse/SPARK-12588
[2] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30

  was:
After SPARK-12588 Remove HTTPBroadcast [1], the one and only
implementation of BroadcastFactory is TorrentBroadcastFactory. No code
in Spark 2 uses BroadcastFactory (but TorrentBroadcastFactory) however
the scaladoc says [2]:

/**
 * An interface for all the broadcast implementations in Spark (to allow
 * multiple broadcast implementations). SparkContext uses a user-specified
 * BroadcastFactory implementation to instantiate a particular broadcast for the
 * entire Spark job.
 */

which is not correct since there is no way to plug in a custom
user-specified BroadcastFactory.

It is better to readd spark.broadcast.factory for user-defined BroadcastFactory
 
[1] https://issues.apache.org/jira/browse/SPARK-12588
[2] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30


> Clarify that user-defined BroadcastFactory is not supported
> ---
>
> Key: SPARK-18742
> URL: https://issues.apache.org/jira/browse/SPARK-18742
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Song Jun
>Priority: Trivial
>
> After SPARK-12588 Remove HTTPBroadcast [1], the one and only
> implementation of BroadcastFactory is TorrentBroadcastFactory. however
> the scaladoc says [2]:
> /**
>  * An interface for all the broadcast implementations in Spark (to allow
>  * multiple broadcast implementations). SparkContext uses a user-specified
>  * BroadcastFactory implementation to instantiate a particular broadcast for 
> the
>  * entire Spark job.
>  */
> so we should modify the comment that SparkContext will not use a  
> user-specified BroadcastFactory implementation
> [1] https://issues.apache.org/jira/browse/SPARK-12588
> [2] 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/broadcast/BroadcastFactory.scala#L25-L30



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >